Annotate with --show-ids has encoding problem

Bug #79084 reported by James Westby
2
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
Unassigned

Bug Description

With latest bzr.dev

./bzr ann --show-ids NEWS

fails with

bzr: ERROR: exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

Traceback (most recent call last):
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/commands.py", line 650, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/commands.py", line 612, in run_bzr
    ret = run(*run_argv)
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/commands.py", line 304, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/commands.py", line 622, in ignore_pipe
    result = func(*args, **kwargs)
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/builtins.py", line 2745, in run
    show_ids=show_ids)
  File "/home/jw2328/devel/bzr/bzr.dev.new/bzrlib/annotate.py", line 56, in annotate_file
    to_file.write('%*s | %s' % (max_origin_len, this, text))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

bzr 0.14.0dev0 on python 2.4.4.final.0 (linux2)
arguments: ['./bzr', 'ann', '--show-ids', 'NEWS']

** please send this report to <email address hidden>

Changed in bzr:
importance: Undecided → Medium
status: Unconfirmed → Confirmed
Revision history for this message
John A Meinel (jameinel) wrote :

I can confirm this, and I know where the problems are...

Specifically the line:

to_file.write('%*s | %s' % (max_origin_len, this, text))

is failing because 'this' (being the revision id) is a unicode object, while 'text' is whatever text is in the file.

We *don't* want to decode the text of the file, because it can be anything the user entered. But when you have a structure like the above, it automatically upcasts the second string to being Unicode.

It should actually be failing for 'bzr annotate --long' but it seems like the API of cElementTree has changed a little bit. It now returns a plain string if it can, rather than always returning a Unicode string. Which means that when possible, the Author (committer) field is returned as a plain string, rather than Unicode. (And Erik Bågfors hasn't made a change to NEWS that includes a line with a non-ascii character)

The difference is that our unpacking code actually always makes sure that revision ids are unicode, because it uses cache_utf8.get_cached_unicode(revision_id) We do that, because it means our revision ids are single objects (which saves some memory, and usually improves performance a bit)

So there are a couple easy ways to reproduce:

$ bzr init foo
$ cd foo
$ bzr whoami --branch "Erik Bågfors <email address hidden>"
$ echo -e "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --long a
$ bzr annotate --show-ids a

# even without the whoami, you can do

$ bzr init foo
$ cd foo
$ echo "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --show-ids a

How to fix...

I think the correct fix is to declare that all annotation information is in either terminal encoding or utf-8 encoding. (I would *really* like to say utf-8, but that might be difficult for someone like Alexander). Actually, because of terminal differences, it might be best to be in file-encoding (yet another value on windows), which I think is closest to bzrlib.user_encoding.

The reason is it has the best chance to be in the same encoding as the actual contents of the file. Which means that doing: bzr annotate foo.txt > foo.ann.txt; notepad foo.ann.txt
Is the most likely to give you something readable.

(One small note, is that the actual annotations we add are going to be ascii, because to date revision ids have been 100% ascii, and we only include the email portion of the author, which is also ascii...)

Attached is a patch which uses user encoding.

Revision history for this message
Alexander Belchenko (bialix) wrote : Re: [Bug 79084] Re: Annotate with --show-ids has encoding problem

John A Meinel пишет:
> I can confirm this, and I know where the problems are...
<skip>
>
> (One small note, is that the actual annotations we add are going to be
> ascii, because to date revision ids have been 100% ascii, and we only
> include the email portion of the author, which is also ascii...)

I should point to the fact that email portion of the author
could be non-ascii: if I login on Windows with russian name
and *don't* set my user-id for bzr, then bzr create auto-id
based on my login and computer name. And in this case we
easy can have e-mail like string with russian characters, like:

<Саша@home>

Alexander

Revision history for this message
John A Meinel (jameinel) wrote :

fair enough. So the question becomes, what is more useful to you, Alexander. osutils.terminal_encoding() or get_user_encoding()

I realize you may not use 'bzr annotate' much. I'm just trying to figure out if 'terminal_encoding()' is going to be better because people typically only have ascii text, and are trying to annotate on a terminal. Or if user encoding is better, because they will tend to have ascii annotations anyway, and if they have non-ascii text, they have the best chance of being able to read it by redirecting to a file, and reading it there.

My gut feeling is that 'get_user_encoding()' is better in this instance, which is why it is what my patch does.

Revision history for this message
Alexander Belchenko (bialix) wrote :

John A Meinel пишет:
> fair enough. So the question becomes, what is more useful to you,
> Alexander. osutils.terminal_encoding() or get_user_encoding()

I think the right solution is to use get_user_encoding()
because on Windows it equivalent to system default ANSI encoding.

If people don't satisfied with this encoding they *always*
can switch encoding of dos terminal with chcp command.
I use this command when I want to see in dos terminal
a diff for my non-ascii files.

>
> I realize you may not use 'bzr annotate' much. I'm just trying to figure
> out if 'terminal_encoding()' is going to be better because people
> typically only have ascii text, and are trying to annotate on a
> terminal. Or if user encoding is better, because they will tend to have
> ascii annotations anyway, and if they have non-ascii text, they have the
> best chance of being able to read it by redirecting to a file, and
> reading it there.
>
> My gut feeling is that 'get_user_encoding()' is better in this instance,
> which is why it is what my patch does.
>

Revision history for this message
Alexander Belchenko (bialix) wrote :

John A Meinel пишет:
> fair enough. So the question becomes, what is more useful to you,
> Alexander. osutils.terminal_encoding() or get_user_encoding()
>
> I realize you may not use 'bzr annotate' much. I'm just trying to figure
> out if 'terminal_encoding()' is going to be better because people
> typically only have ascii text, and are trying to annotate on a
> terminal. Or if user encoding is better, because they will tend to have
> ascii annotations anyway, and if they have non-ascii text, they have the
> best chance of being able to read it by redirecting to a file, and
> reading it there.
>
> My gut feeling is that 'get_user_encoding()' is better in this instance,
> which is why it is what my patch does.

Thinking more about this I want to say that encoding problems is very frustrating.
On Windows users typically edit their files in GUI editors and therefore
use ANSI encoding (i.e. get_user_encoding). When bzr provide good GUI
for bzr it's easy to say that get_user_encoding is best guess.

But for some old-fashioned boys who remember DOS days and still use OEM encodings
for their files -- int this case using user encoding is wrong.

And because user can control encoding of terminal with chcp command
it's better to use terminal encoding.

It's really hard to say exactly what's better... Sigh...

--
Alexander

Jelmer Vernooij (jelmer)
Changed in bzr:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.