Ubuntu
langpack-locales package

sort -u erase some utf8 characters

Bug #821951 reported by An Yang on 2011-08-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	eglibc	Confirmed	Critical	sourceware-bugs #13063
	langpack-locales (Ubuntu)	Triaged	High	Unassigned

Bug Description

sort -u will erase some utf8 characters.

see attachment for detail data.
sort -u x.sorted.utf8 > x.sorted.uniq.utf8
diff x.sorted.uniq.utf8 x.sorted.utf8 > x.diff

Tags:

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

x.sorted.utf8 Edit (78.2 KiB, text/plain)

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

x.sorted.uniq.utf8 Edit (75.5 KiB, text/plain)

my result of sort -u x.sorted.utf8 > x.sorted.uniq.utf8

I do this in lucid and natty, got the same problem.

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

x.diff Edit (4.0 KiB, text/plain)

my x.diff file, sort -u erase 686 chinese characters.

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

My locale is:

LANG=zh_CN.utf8
LANGUAGE=zh_CN:zh
LC_CTYPE="zh_CN.utf8"
LC_NUMERIC="zh_CN.utf8"
LC_TIME="zh_CN.utf8"
LC_COLLATE="zh_CN.utf8"
LC_MONETARY="zh_CN.utf8"
LC_MESSAGES="zh_CN.utf8"
LC_PAPER="zh_CN.utf8"
LC_NAME="zh_CN.utf8"
LC_ADDRESS="zh_CN.utf8"
LC_TELEPHONE="zh_CN.utf8"
LC_MEASUREMENT="zh_CN.utf8"
LC_IDENTIFICATION="zh_CN.utf8"
LC_ALL=

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

x.diff Edit (210.5 KiB, text/plain)

If I set LANG to en_US.utf8, sort -u erase 2716 chinese characters.
See attachment please.

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

The reason is eglibc/glibc just supports CJK UNIFIED IDEOGRAPH (<U4E00>- <U9FA5>) defined in iso10646:1993.
EGlibc/glibc lack support of CJK UNIFIED IDEOGRAPH A/B/C/D defined in iso10646:2011.

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

Sorry, lost a word.
EGlibc/glibc lack support of CJK UNIFIED IDEOGRAPH EXTENSION A/B/C/D defined in iso10646:2011.
CJK UNIFIED IDEOGRAPH EXTENSION A is included in GB18030:2005, and GB18030:2005 is the China locale standard.

affects:	coreutils (Ubuntu) → eglibc (Ubuntu)
Changed in eglibc (Ubuntu):
status:	New → Confirmed

Revision history for this message

An Yang (euroford) wrote on 2011-08-06:

All of the lost 686 Chinese characters locate in CJK UNIFIED IDEOGRAPH EXTENSION A block.

Revision history for this message

In Sourceware.org Bugzilla #13063, An Yang (euroford) wrote on 2011-08-06:

Hi,

Refer to glibc/localedata/locales/zh_CN and iso14651_t1_pinyin or
iso14651_t1, glibc just support unicode3.0.

The new version of unicode is 6.0, it extend CJK UNIFIED IDEOGRAPH with
extension A/B/C/D, and extension A is included in GB18030:2005( China
locale charset standard).

So at least, glibc should sort all Chinese characters in CJK UNIFIED IDEOGRAPH and EXTENSIONA(U+3400-U+4DBF).

The real effect is sort -u.
If you execute sort -u examples_CJK_extensionA.txt (see attachment), you
will got only one Chinese character "㑗".

Regards,
An Yang

Revision history for this message

In Sourceware.org Bugzilla #13063, An Yang (euroford) wrote on 2011-08-06:

#10

Created attachment 5880
example characters in CJK extension A.

Bug Watch Updater (bug-watch-updater) on 2011-08-06

Changed in eglibc:
importance:	Unknown → Critical
status:	Unknown → Confirmed

Revision history for this message

In Sourceware.org Bugzilla #13063, An Yang (euroford) wrote on 2011-08-07:

#13

I'm not sure, this bugs has any relationship with charmaps, maybe or may not.
But the value of LC_COLLATE in zh_CN is:

% ISO 14651 collation sequence
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE

I'm sure, something is wrong in this table.

All the erased Chinese characters do not a record in iso14651_t1_pinyin, but they are included in CJK unified Ideographs/ExtA/B/C/D.

Revision history for this message

An Yang (euroford) wrote on 2011-08-07:

#11

Something is wrong in iso14651_t1_pinyin and iso14651_t1

affects:

eglibc (Ubuntu) → langpack-locales (Ubuntu)

Revision history for this message

An Yang (euroford) wrote on 2011-08-07:

#12

iso14651_t1.diff Edit (894 bytes, text/plain)

This patch can fix this bug, when sort -u was executed in any LANG except for zh_CN.

Revision history for this message

In Sourceware.org Bugzilla #13063, An Yang (euroford) wrote on 2011-08-08:

#14

There are 25496 Chinese characters in iso14651_t1_pinyin, most of them distribute over CJK unified ideographs and CJK unified ideographs extension A.

But there are 27552 Chinese characters in CJK unified ideographs and extension A, more than 2000 Chinese characters without pinyin were losted.

So my suggestion is just add the losted characters at the end of the iso14651_t1_pinyin, in the order of unicode.

Could you give me any feedback?

Martin Pitt (pitti) on 2011-11-16

Changed in langpack-locales (Ubuntu):
importance:	Undecided → High
status:	Confirmed → Triaged

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2011-11-16:

#15

The attachment "iso14651_t1.diff" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags:

added: patch