sort -u erase some utf8 characters

Bug #821951 reported by An Yang
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
eglibc
Confirmed
Critical
langpack-locales (Ubuntu)
Triaged
High
Unassigned

Bug Description

sort -u will erase some utf8 characters.

see attachment for detail data.
sort -u x.sorted.utf8 > x.sorted.uniq.utf8
diff x.sorted.uniq.utf8 x.sorted.utf8 > x.diff

Tags: patch
Revision history for this message
An Yang (euroford) wrote :
Revision history for this message
An Yang (euroford) wrote :

my result of sort -u x.sorted.utf8 > x.sorted.uniq.utf8

I do this in lucid and natty, got the same problem.

Revision history for this message
An Yang (euroford) wrote :

my x.diff file, sort -u erase 686 chinese characters.

Revision history for this message
An Yang (euroford) wrote :

My locale is:

LANG=zh_CN.utf8
LANGUAGE=zh_CN:zh
LC_CTYPE="zh_CN.utf8"
LC_NUMERIC="zh_CN.utf8"
LC_TIME="zh_CN.utf8"
LC_COLLATE="zh_CN.utf8"
LC_MONETARY="zh_CN.utf8"
LC_MESSAGES="zh_CN.utf8"
LC_PAPER="zh_CN.utf8"
LC_NAME="zh_CN.utf8"
LC_ADDRESS="zh_CN.utf8"
LC_TELEPHONE="zh_CN.utf8"
LC_MEASUREMENT="zh_CN.utf8"
LC_IDENTIFICATION="zh_CN.utf8"
LC_ALL=

Revision history for this message
An Yang (euroford) wrote :

If I set LANG to en_US.utf8, sort -u erase 2716 chinese characters.
See attachment please.

Revision history for this message
An Yang (euroford) wrote :

The reason is eglibc/glibc just supports CJK UNIFIED IDEOGRAPH (<U4E00>- <U9FA5>) defined in iso10646:1993.
EGlibc/glibc lack support of CJK UNIFIED IDEOGRAPH A/B/C/D defined in iso10646:2011.

Revision history for this message
An Yang (euroford) wrote :

Sorry, lost a word.
EGlibc/glibc lack support of CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D defined in iso10646:2011.
CJK UNIFIED IDEOGRAPH EXTENSION A is included in GB18030:2005, and GB18030:2005 is the China locale standard.

affects: coreutils (Ubuntu) → eglibc (Ubuntu)
Changed in eglibc (Ubuntu):
status: New → Confirmed
Revision history for this message
An Yang (euroford) wrote :

All of the lost 686 Chinese characters locate in CJK UNIFIED IDEOGRAPH EXTENSION A block.

Revision history for this message
In , An Yang (euroford) wrote :

Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".

Regards,
An Yang

Revision history for this message
In , An Yang (euroford) wrote :

Created attachment 5880
example characters in CJK extension A.

Changed in eglibc:
importance: Unknown → Critical
status: Unknown → Confirmed
Revision history for this message
In , An Yang (euroford) wrote :

I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D.

Revision history for this message
An Yang (euroford) wrote :

Something is wrong in iso14651_t1_pinyin and iso14651_t1

affects: eglibc (Ubuntu) → langpack-locales (Ubuntu)
Revision history for this message
An Yang (euroford) wrote :

This patch can fix this bug, when sort -u was executed in any LANG except for zh_CN.

Revision history for this message
In , An Yang (euroford) wrote :

There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

Martin Pitt (pitti)
Changed in langpack-locales (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "iso14651_t1.diff" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.