Bug #79084 “Annotate with --show-ids has encoding problem” : Bugs : Bazaar

Wouter van Heyst (larstiq) on 2007-01-13

Changed in bzr:
importance:	Undecided → Medium
status:	Unconfirmed → Confirmed

Revision history for this message

John A Meinel (jameinel) wrote on 2007-01-16:

#1

Properly encode author and revision id. Edit (1.3 KiB, text/plain)

I can confirm this, and I know where the problems are...

Specifically the line:

to_file.write('%*s | %s' % (max_origin_len, this, text))

is failing because 'this' (being the revision id) is a unicode object, while 'text' is whatever text is in the file.

We *don't* want to decode the text of the file, because it can be anything the user entered. But when you have a structure like the above, it automatically upcasts the second string to being Unicode.

It should actually be failing for 'bzr annotate --long' but it seems like the API of cElementTree has changed a little bit. It now returns a plain string if it can, rather than always returning a Unicode string. Which means that when possible, the Author (committer) field is returned as a plain string, rather than Unicode. (And Erik Bågfors hasn't made a change to NEWS that includes a line with a non-ascii character)

The difference is that our unpacking code actually always makes sure that revision ids are unicode, because it uses cache_utf8.get_cached_unicode(revision_id) We do that, because it means our revision ids are single objects (which saves some memory, and usually improves performance a bit)

So there are a couple easy ways to reproduce:

$ bzr init foo
$ cd foo
$ bzr whoami --branch "Erik Bågfors <email address hidden>"
$ echo -e "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --long a
$ bzr annotate --show-ids a

# even without the whoami, you can do

$ bzr init foo
$ cd foo
$ echo "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --show-ids a

How to fix...

I think the correct fix is to declare that all annotation information is in either terminal encoding or utf-8 encoding. (I would *really* like to say utf-8, but that might be difficult for someone like Alexander). Actually, because of terminal differences, it might be best to be in file-encoding (yet another value on windows), which I think is closest to bzrlib.user_encoding.

The reason is it has the best chance to be in the same encoding as the actual contents of the file. Which means that doing: bzr annotate foo.txt > foo.ann.txt; notepad foo.ann.txt
Is the most likely to give you something readable.

(One small note, is that the actual annotations we add are going to be ascii, because to date revision ids have been 100% ascii, and we only include the email portion of the author, which is also ascii...)

Attached is a patch which uses user encoding.

I can confirm this, and I know where the problems are...

Specifically the line:

to_file.write('%*s | %s' % (max_origin_len, this, text))

is failing because 'this' (being the revision id) is a unicode object, while 'text' is whatever text is in the file.

We *don't* want to decode the text of the file, because it can be anything the user entered. But when you have a structure like the above, it automatically upcasts the second string to being Unicode.

It should actually be failing for 'bzr annotate --long' but it seems like the API of cElementTree has changed a little bit. It now returns a plain string if it can, rather than always returning a Unicode string. Which means that when possible, the Author (committer) field is returned as a plain string, rather than Unicode. (And Erik Bågfors hasn't made a change to NEWS that includes a line with a non-ascii character)

The difference is that our unpacking code actually always makes sure that revision ids are unicode, because it uses cache_utf8.get_cached_unicode(revision_id) We do that, because it means our revision ids are single objects (which saves some memory, and usually improves performance a bit)

So there are a couple easy ways to reproduce:

$ bzr init foo
$ cd foo
$ bzr whoami --branch "Erik Bågfors <nobody@nowhere.com>"
$ echo -e "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --long a
$ bzr annotate --show-ids a

# even without the whoami, you can do

$ bzr init foo
$ cd foo
$ echo "bår" > a
$ bzr add a
$ bzr commit -m a
$ bzr annotate --show-ids a

How to fix...

I think the correct fix is to declare that all annotation information is in either terminal encoding or utf-8 encoding. (I would *really* like to say utf-8, but that might be difficult for someone like Alexander). Actually, because of terminal differences, it might be best to be in file-encoding (yet another value on windows), which I think is closest to bzrlib.user_encoding.

The reason is it has the best chance to be in the same encoding as the actual contents of the file. Which means that doing: bzr annotate foo.txt > foo.ann.txt; notepad foo.ann.txt
Is the most likely to give you something readable.

(One small note, is that the actual annotations we add are going to be ascii, because to date revision ids have been 100% ascii, and we only include the email portion of the author, which is also ascii...)

Attached is a patch which uses user encoding.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2007-01-16: Re: [Bug 79084] Re: Annotate with --show-ids has encoding problem

#2

John A Meinel пишет:
> I can confirm this, and I know where the problems are...
<skip>
>
> (One small note, is that the actual annotations we add are going to be
> ascii, because to date revision ids have been 100% ascii, and we only
> include the email portion of the author, which is also ascii...)

I should point to the fact that email portion of the author
could be non-ascii: if I login on Windows with russian name
and *don't* set my user-id for bzr, then bzr create auto-id
based on my login and computer name. And in this case we
easy can have e-mail like string with russian characters, like:

<Саша@home>

Alexander

Revision history for this message

John A Meinel (jameinel) wrote on 2007-01-16:

#3

fair enough. So the question becomes, what is more useful to you, Alexander. osutils.terminal_encoding() or get_user_encoding()

I realize you may not use 'bzr annotate' much. I'm just trying to figure out if 'terminal_encoding()' is going to be better because people typically only have ascii text, and are trying to annotate on a terminal. Or if user encoding is better, because they will tend to have ascii annotations anyway, and if they have non-ascii text, they have the best chance of being able to read it by redirecting to a file, and reading it there.

My gut feeling is that 'get_user_encoding()' is better in this instance, which is why it is what my patch does.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2007-01-16:

#4

John A Meinel пишет:
> fair enough. So the question becomes, what is more useful to you,
> Alexander. osutils.terminal_encoding() or get_user_encoding()

I think the right solution is to use get_user_encoding()
because on Windows it equivalent to system default ANSI encoding.

If people don't satisfied with this encoding they *always*
can switch encoding of dos terminal with chcp command.
I use this command when I want to see in dos terminal
a diff for my non-ascii files.

>
> I realize you may not use 'bzr annotate' much. I'm just trying to figure
> out if 'terminal_encoding()' is going to be better because people
> typically only have ascii text, and are trying to annotate on a
> terminal. Or if user encoding is better, because they will tend to have
> ascii annotations anyway, and if they have non-ascii text, they have the
> best chance of being able to read it by redirecting to a file, and
> reading it there.
>
> My gut feeling is that 'get_user_encoding()' is better in this instance,
> which is why it is what my patch does.
>

Revision history for this message

Alexander Belchenko (bialix) wrote on 2007-01-16:

#5

John A Meinel пишет:
> fair enough. So the question becomes, what is more useful to you,
> Alexander. osutils.terminal_encoding() or get_user_encoding()
>
> I realize you may not use 'bzr annotate' much. I'm just trying to figure
> out if 'terminal_encoding()' is going to be better because people
> typically only have ascii text, and are trying to annotate on a
> terminal. Or if user encoding is better, because they will tend to have
> ascii annotations anyway, and if they have non-ascii text, they have the
> best chance of being able to read it by redirecting to a file, and
> reading it there.
>
> My gut feeling is that 'get_user_encoding()' is better in this instance,
> which is why it is what my patch does.

Thinking more about this I want to say that encoding problems is very frustrating.
On Windows users typically edit their files in GUI editors and therefore
use ANSI encoding (i.e. get_user_encoding). When bzr provide good GUI
for bzr it's easy to say that get_user_encoding is best guess.

But for some old-fashioned boys who remember DOS days and still use OEM encodings
for their files -- int this case using user encoding is wrong.

And because user can control encoding of terminal with chcp command
it's better to use terminal encoding.

It's really hard to say exactly what's better... Sigh...

--
Alexander

Jelmer Vernooij (jelmer) on 2011-02-01

Changed in bzr:
status:	Confirmed → Fix Released

Bazaar

Annotate with --show-ids has encoding problem

Bug Description

Other bug subscribers

Patches

Remote bug watches