strings should be normalized

Bug #100086 reported by Denis Moyogo Jacquerye
4
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
Low
Unassigned

Bug Description

Unicode strings should be normalized in some form, probably NFC for better legacy compatibility.

Right now translators can type with decomposed or composed characters but launchpad doesn't normalize the strings when saving, nor when searching.

For example, a translator might use a keyboard with precomposed characters such as 'é' and another a keyboard with composed characters such as "é". Launchpad doesn't consider these two to be the same yet Unicode defines them as being equivalent.
Another example is the search, searching "é" or "é" give different results when it should give the same results.

NFC is strongly suggested since it is the form used by the W3C Charater model. See http://www.w3.org/TR/charmod-norm/#sec-NormalizationMotivation

See http://www.unicode.org/reports/tr15/ for more info on normalization of equivalent strings.

Changed in launchpad:
importance: Undecided → Medium
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

See here for a function that can do this for us: http://www.python.org/doc/2.4/lib/module-unicodedata.html

Changed in rosetta:
status: New → Confirmed
Revision history for this message
Данило Шеган (danilo) wrote :

I am convinced we should not do it. We are getting translations from multiple sources and it's best if we keep them verbatim so we can better track their history and origin.

Changed in rosetta:
importance: Medium → Low
status: Triaged → Won't Fix
Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

> I am convinced we should not do it.We are getting translations from
> multiple sources and it's best if we keep them verbatim so we can
> better track their history and origin.

I don't understand how not normalizing helps or how normalizing prevent tracking history.
In any case the issue of normalization remains in Launchpad, notably with searches.

If one user translates using one form and another searches that translation using the other form, it won't match.
For exemple, user A translated using the word "é" (NFC).
User B wants to look for translations using the word "é" (NFD) but doesn't find that of user A.

Strings should be normalized during searches for matching.

Revision history for this message
Данило Шеган (danilo) wrote :

Yes, searches should work better. Though, we should rely on our infrastructure to provide that (i.e. Postgres and/or Postgres full-text-search once we start using that).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.