diff is confused by japanese filenames

Bug #130553 reported by Fergal Daly
4
Affects Status Importance Assigned to Milestone
diffutils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: diff

Using diff 2.8.1-11ubuntu

Here's some output from a diff -r

root@vivo:/media/other/100g/mungo/backuppc# diff -qr . /media/mirrored/100g-2/backuppc/
File ./log/BackupPC.sock is a socket while file /media/mirrored/100g-2/backuppc/log/BackupPC.sock is a socket
Files ./pc/lap.local/32/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ セント・ラファエラ LC2.doc and /media/mirrored/100g-2/backuppc/pc/lap.local/32/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ ホーリーファミリー LC2.doc differ
Files ./pc/lap.local/32/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ ホーリーファミリー LC2.doc and /media/mirrored/100g-2/backuppc/pc/lap.local/32/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ セント・ラファエラ LC2.doc differ
Files ./pc/lap.local/34/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ セント・ラファエラ LC2.doc and /media/mirrored/100g-2/backuppc/pc/lap.local/34/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ ホーリーファミリー LC2.doc differ
Files ./pc/lap.local/34/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ ホーリーファミリー LC2.doc and /media/mirrored/100g-2/backuppc/pc/lap.local/34/f%2f/fhome/fmidori/fDesktop/f産休関係/f産休引き継ぎ セント・ラファエラ LC2.doc differ

In 4 lines it says the files differ. They do because they're different files. For some reason diff -r has decided to compare

"f産休関係/f産休引き継ぎ セント・ラファエラ LC2.doc" with
"f産休関係/f産休引き継ぎ ホーリーファミリー LC2.doc"

I have been able to reproduce this on a second run of diff.

The whole directory system that's I'm diffing is 20G so it takes a while. I'm also going on vacation tomorrow so I don't have time to try produce a minimal test case.

Revision history for this message
Fergal Daly (fergal) wrote :

By the way, comparing the files individually shows there are no diffs.

diff -qr on just the directory with those files gives the same problem.

Creating dummy files with just those names in another directory does not reproduce. Nor does recreating the whole directory with dummy files.

I can't do anything for about a week but if you have any test you'd like me to run, let me know.

Revision history for this message
Daniel T Chen (crimsun) wrote :

Is this symptom still reproducible in 8.10 beta or later?

Changed in diff:
status: New → Incomplete
Revision history for this message
Shawn Ligocki (sligocki) wrote :

Yeah, it appears that diff cannot appropriately parse Unicode filenames. I tested this quite a bit and discovered that diff can distinguish filenames if

* they are different lengths or
* they have ASCII characters that are different (or in different locations).

Otherwise it fails. Simple case:
$ ls -R
.:
1 2

./1:

./2:

$diff 1 2
diff 1/セ 2/ホ
1,2c1,2
< File in dir 1 called:
< セ
---
> File in dir 2 called:
> ホ

Whereas, it should behave equivalently to:
$ls -R
.:
1 2

./1:
セ1

./2:
ホ2

$diff 1 2
Only in 1: セ1
Only in 2: ホ2

Changed in diffutils (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Shawn Ligocki (sligocki) wrote :

By the way, this is using 8.10 fully updated.

Revision history for this message
Shawn Ligocki (sligocki) wrote :

I checked on a few other systems I have access to and it works correctly on all of them:
Red Hat Enterprise 5.3,
Debian 5.0
SunOS 5.8

However I don't know configuration details about these.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.