add: accepts aliases due to case insensitivity

Bug #172861 reported by David Roberts
2
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Medium
Unassigned
Breezy
Triaged
Medium
Unassigned

Bug Description

The following appears to demonstrate that on Win32
there is a bug in add that will allow filenames which
are aliases due to case-insensitivity to be added.

G:\bzrtempco>bzr --version
Bazaar (bzr) 0.91.0 [...... ]

G:\bzrtempco>mkdir bzr91

G:\bzrtempco>cd bzr91

G:\bzrtempco\bzr91>bzr init

G:\bzrtempco\bzr91>echo content > Bzr

G:\bzrtempco\bzr91>bzr add
added Bzr

G:\bzrtempco\bzr91>bzr commit -m"initial checkin"
Committing revision 1 to "G:/bzrtempco/bzr91/".
added Bzr
Committed revision 1.

G:\bzrtempco\bzr91>rename Bzr bZR

G:\bzrtempco\bzr91>bzr add
added bZR

G:\bzrtempco\bzr91>

[Related to #34057, #77744]

Revision history for this message
John A Meinel (jameinel) wrote :

I think you might be able to trigger this without the 'rename'.
Just doing:

bzr init test
cd test
touch Foo
bzr add Foo
bzr commit -m "Foo"
bzr add foO
bzr commit -m "foO"

At the end I have:
bzr log --short --verbose:
    1 John Arbash Meinel 2007-11-29
      Foo
added:
  Foo

    2 John Arbash Meinel 2007-11-29
      foO
added:
  foO

and

% bzr status
removed:
  foO

But
% bzr commit -m 'test'
Committing revision 3 to ".../test/".
bzr: ERROR: no changes to commit. use --unchanged to commit anyhow

This happens because the "bzr status" code hasn't been unified with the "bzr commit" code. (commit iterates through the recorded files, and does a stat to see if they exist on disk, which shows them as present. 'bzr status' iterates through a listdir to see what files exist while it is iterating through the recorded list, so it has already consumed Foo when it sees foO.)

The other problem, is that I don't know of any way to get the real name of a file. You could use "os.listdir(os.path.dirname(path))" and then search through it for possible names, but you don't know if foO is there because of a rename, or if it is actually a different file.
It would be nice if you could do:

st = os.stat('foo')
and have "st" have a st_name or some other property that gives you the exact name on disk.

Does anyone know if that is possible?

Changed in bzr:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Alexander Belchenko (bialix) wrote : Re: [Bug 172861] Re: add: accepts aliases due to case insensitivity

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel пишет:
>
> The other problem, is that I don't know of any way to get the real name of a file. You could use "os.listdir(os.path.dirname(path))" and then search through it for possible names, but you don't know if foO is there because of a rename, or if it is actually a different file.
> It would be nice if you could do:
>
> st = os.stat('foo')
> and have "st" have a st_name or some other property that gives you the exact name on disk.
>
> Does anyone know if that is possible?

I know. Recently I dig through MSDN and found this way to determine real filename without
using os.listdir (it's too expensive in Python). You need pywin32 library for this, or
writing C-extension. Using ctypes is also possible but it's too verbose in Python.

Here the actual code:

import win32file
names_list = win32file.FindFilesW(path_in_question) # returns list of WIN32_FIND_DATA structs
# if path_in_question contains wildcard characters * or ? then we get list of al matching files,
# like with glob function
real_name = names_list[0][8]

Here is example running in bzr.dev tree:

In [4]: import win32file

In [5]: names_list = win32file.FindFilesW('BzR')

In [6]: names_list[0][8]
Out[6]: u'bzr'

In [7]: len(names_list)
Out[7]: 1
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHTwwMzYr338mxwCURAi9NAJ42y5oTAb8nLnAPWUO/DkM2nM7LAwCeL89s
q28tkBoR6W5ydoG3Ynrl7q0=
=VhtB
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> John A Meinel ?8H5B:
>> The other problem, is that I don't know of any way to get the real name of a file. You could use "os.listdir(os.path.dirname(path))" and then search through it for possible names, but you don't know if foO is there because of a rename, or if it is actually a different file.
>> It would be nice if you could do:
>
>> st = os.stat('foo')
>> and have "st" have a st_name or some other property that gives you the exact name on disk.
>
>> Does anyone know if that is possible?
>
> I know. Recently I dig through MSDN and found this way to determine real filename without
> using os.listdir (it's too expensive in Python). You need pywin32 library for this, or
> writing C-extension. Using ctypes is also possible but it's too verbose in Python.
>
> Here the actual code:
>
> import win32file
> names_list = win32file.FindFilesW(path_in_question) # returns list of WIN32_FIND_DATA structs
> # if path_in_question contains wildcard characters * or ? then we get list of al matching files,
> # like with glob function
> real_name = names_list[0][8]
>
> Here is example running in bzr.dev tree:
>
> In [4]: import win32file
>
> In [5]: names_list = win32file.FindFilesW('BzR')
>
> In [6]: names_list[0][8]
> Out[6]: u'bzr'
>
> In [7]: len(names_list)
> Out[7]: 1

We might think about using something like this. But it pains me to think about
making 50k calls to FindFilesW just to make sure that when we stat to see if
'X' exists that it isn't really called 'x'....

Maybe once we get commit to use an _iter_changes() style api, it will detect
these in a more straightforward manner.

And then we could use something like FindFilesW for things like misses, to see
if they really missed, or if it is just a case-insensitivity thing.

We certainly could use it for sanitizing user parameters (for things like bzr
add, or maybe even bzr status).

Because those are usually limited (obviously if someone is scripting you may
get a bunch, but there is still the limit of number of characters on the
command line).

Do we have any similar function for Mac OS X? Especially since Mac likes to
rename files based on unicode normalization. (And as someone commented it isn't
pure NFC, so you really need to ask Mac what the name is.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHTwzgJdeBCYSNAAMRAgF1AJ4346d55sth0lZkrfz8zhzDC/+X1QCguJxA
Jev1j5DxreqOSysG5QGrCSY=
=QTdu
-----END PGP SIGNATURE-----

Revision history for this message
Alexander Belchenko (bialix) wrote :
Download full text (3.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel пишет:
> Alexander Belchenko wrote:
>> John A Meinel ?8H5B:
>>> The other problem, is that I don't know of any way to get the real name of a file. You could use "os.listdir(os.path.dirname(path))" and then search through it for possible names, but you don't know if foO is there because of a rename, or if it is actually a different file.
>>> It would be nice if you could do:
>>> st = os.stat('foo')
>>> and have "st" have a st_name or some other property that gives you the exact name on disk.
>>> Does anyone know if that is possible?
>> I know. Recently I dig through MSDN and found this way to determine real filename without
>> using os.listdir (it's too expensive in Python). You need pywin32 library for this, or
>> writing C-extension. Using ctypes is also possible but it's too verbose in Python.
>
>> Here the actual code:
>
>> import win32file
>> names_list = win32file.FindFilesW(path_in_question) # returns list of WIN32_FIND_DATA structs
>> # if path_in_question contains wildcard characters * or ? then we get list of al matching files,
>> # like with glob function
>> real_name = names_list[0][8]
>
>> Here is example running in bzr.dev tree:
>
>> In [4]: import win32file
>
>> In [5]: names_list = win32file.FindFilesW('BzR')
>
>> In [6]: names_list[0][8]
>> Out[6]: u'bzr'
>
>> In [7]: len(names_list)
>> Out[7]: 1
>
>
> We might think about using something like this. But it pains me to think about
> making 50k calls to FindFilesW just to make sure that when we stat to see if
> 'X' exists that it isn't really called 'x'....

Looking at internal realization of os.listdir in CPython I'm sure one day we should
rewrite win32 walkdirs code and throw away os.listdir *completely*.
Because:
1) os.listdir use FindFiles API internally
2) WIN32_FIND_DATA contains *all* stat info, so additional os.lstat in walkdirs code
is simply redundant! On win32 we are able produce os.listdir and os.lstat for each
item in os.listdir output in one pass! If you're remember additional os.lstat costs
too much on FAT32 as I discover with my fake symlinks code.
3) IMO walkdirs generator on win32 should emit pair of filenames: real name on disk
and normalized name (lowercased for win32). Again we will be able to produce it
in one pass and therefore get performance win.

>
> Maybe once we get commit to use an _iter_changes() style api, it will detect
> these in a more straightforward manner.
>
> And then we could use something like FindFilesW for things like misses, to see
> if they really missed, or if it is just a case-insensitivity thing.
>
> We certainly could use it for sanitizing user parameters (for things like bzr
> add, or maybe even bzr status).
>
> Because those are usually limited (obviously if someone is scripting you may
> get a bunch, but there is still the limit of number of characters on the
> command line).
>
> Do we have any similar function for Mac OS X? Especially since Mac likes to
> rename files based on unicode normalization. (And as someone commented it isn't
> pure NFC, so you really need to ask Mac what the name is.)
>
> John
> =:->
-----BEGIN PGP SIGNATURE-----
Version:...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, 2007-11-29 at 19:11 +0000, Alexander Belchenko wrote:
>
> Looking at internal realization of os.listdir in CPython I'm sure one
> day we should
> rewrite win32 walkdirs code and throw away os.listdir *completely*.
> Because:
> 1) os.listdir use FindFiles API internally
> 2) WIN32_FIND_DATA contains *all* stat info, so additional os.lstat in
> walkdirs code
> is simply redundant! On win32 we are able produce os.listdir and
> os.lstat for each
> item in os.listdir output in one pass! If you're remember additional
> os.lstat costs
> too much on FAT32 as I discover with my fake symlinks code.
> 3) IMO walkdirs generator on win32 should emit pair of filenames: real
> name on disk
> and normalized name (lowercased for win32). Again we will be able to
> produce it
> in one pass and therefore get performance win.

+1 from me - os.listdirs is not a good interface on unix either. Many
fs's can give us file kind directly; saving us from statting directories
and symlinks.

-Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

description: updated
Revision history for this message
Alexander Belchenko (bialix) wrote :
Martin Pool (mbp)
Changed in bzr:
status: Triaged → Confirmed
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: added: case-sensitivity
removed: check-for-breezy
Changed in brz:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.