Comment 3 for bug 320842

Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks for your report. This is primarily a bug in the col program (in the bsdmainutils package), which is used by man to filter some special characters out of groff output when writing to a file. Unfortunately col does not deal correctly with UTF-8, and the result is a file containing invalid UTF-8 which editors will quite reasonably refuse to treat as UTF-8 and certainly not to automatically detect as UTF-8 (although some editors provide a way to force the issue). This is another symptom of the same problem reported in Debian as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=319952.

In this case, groff outputs the UTF-8-encoded sequence of Unicode codepoints U+2010 U+0008 U+2010 to represent an overstruck (i.e. bold) continuation hyphen. col mangles that into the byte sequence E2 80 E2 80 90, constructed by removing the last byte from the UTF-8 representation of U+2010 and then appending the full representation of that same character. Correct behaviour would be for the U+0008 (backspace) character to backspace over the whole first character, not just part of it.

I'm leaving a bug task open on man-db at a lower importance because I do think man-db bears some responsibility for the tools it uses, even if they're clearly buggy. Given the historical problems with col, I have been wondering for a while if I shouldn't produce a miniature implementation of it and embed it into man. Normally duplication is bad, and it makes me feel uncomfortable, but in this case col's implementation is pretty stable and unlikely to need to vary significantly among systems.