Annotate with --show-ids has encoding problem
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Fix Released
|
Medium
|
Unassigned |
Bug Description
With latest bzr.dev
./bzr ann --show-ids NEWS
fails with
bzr: ERROR: exceptions.
Traceback (most recent call last):
File "/home/
return run_bzr(argv)
File "/home/
ret = run(*run_argv)
File "/home/
return self.run(
File "/home/
result = func(*args, **kwargs)
File "/home/
show_
File "/home/
to_
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
bzr 0.14.0dev0 on python 2.4.4.final.0 (linux2)
arguments: ['./bzr', 'ann', '--show-ids', 'NEWS']
** please send this report to <email address hidden>
Changed in bzr: | |
importance: | Undecided → Medium |
status: | Unconfirmed → Confirmed |
Changed in bzr: | |
status: | Confirmed → Fix Released |
I can confirm this, and I know where the problems are...
Specifically the line:
to_file.write('%*s | %s' % (max_origin_len, this, text))
is failing because 'this' (being the revision id) is a unicode object, while 'text' is whatever text is in the file.
We *don't* want to decode the text of the file, because it can be anything the user entered. But when you have a structure like the above, it automatically upcasts the second string to being Unicode.
It should actually be failing for 'bzr annotate --long' but it seems like the API of cElementTree has changed a little bit. It now returns a plain string if it can, rather than always returning a Unicode string. Which means that when possible, the Author (committer) field is returned as a plain string, rather than Unicode. (And Erik Bågfors hasn't made a change to NEWS that includes a line with a non-ascii character)
The difference is that our unpacking code actually always makes sure that revision ids are unicode, because it uses cache_utf8. get_cached_ unicode( revision_ id) We do that, because it means our revision ids are single objects (which saves some memory, and usually improves performance a bit)
So there are a couple easy ways to reproduce:
$ bzr init foo
$ cd foo
$ bzr whoami --branch "Erik Bågfors <email address hidden>"
$ echo -e "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --long a
$ bzr annotate --show-ids a
# even without the whoami, you can do
$ bzr init foo
$ cd foo
$ echo "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --show-ids a
How to fix...
I think the correct fix is to declare that all annotation information is in either terminal encoding or utf-8 encoding. (I would *really* like to say utf-8, but that might be difficult for someone like Alexander). Actually, because of terminal differences, it might be best to be in file-encoding (yet another value on windows), which I think is closest to bzrlib. user_encoding.
The reason is it has the best chance to be in the same encoding as the actual contents of the file. Which means that doing: bzr annotate foo.txt > foo.ann.txt; notepad foo.ann.txt
Is the most likely to give you something readable.
(One small note, is that the actual annotations we add are going to be ascii, because to date revision ids have been 100% ascii, and we only include the email portion of the author, which is also ascii...)
Attached is a patch which uses user encoding.