"man man > man.txt" produces invalid characters

Bug #320842 reported by robbbert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bsdmainutils (Debian)
Fix Released
Unknown
bsdmainutils (Ubuntu)
Fix Released
High
Unassigned
man-db (Ubuntu)
Won't Fix
Low
Unassigned

Bug Description

Binary package hint: man-db

The file man.txt, produced by the command "man man > man.txt" (other man pages are concerned, too), will display invalid characters in different text editors (gedit, nano, abiword). These invalid characters include the continuation hyphen, and they're obviously (examined in a Hex editor) Unicode.

This does not happen when using man's "--encoding" option (tested with UTF-8 and ISO-8859-1).
gedit will auto-detect the file's character set as ISO-8859-15, but will reject opening the file when the UTF-8 character set is explicitely set in the File Open dialog.
In gnome-terminal, the man page is displayed correctly (with no "--encoding" option, or converted to UTF-8).

Ubuntu 8.10
man-db 2.5.2-2

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Related branches

Revision history for this message
robbbert (robbbert) wrote :
Revision history for this message
robbbert (robbbert) wrote :
Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks for your report. This is primarily a bug in the col program (in the bsdmainutils package), which is used by man to filter some special characters out of groff output when writing to a file. Unfortunately col does not deal correctly with UTF-8, and the result is a file containing invalid UTF-8 which editors will quite reasonably refuse to treat as UTF-8 and certainly not to automatically detect as UTF-8 (although some editors provide a way to force the issue). This is another symptom of the same problem reported in Debian as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=319952.

In this case, groff outputs the UTF-8-encoded sequence of Unicode codepoints U+2010 U+0008 U+2010 to represent an overstruck (i.e. bold) continuation hyphen. col mangles that into the byte sequence E2 80 E2 80 90, constructed by removing the last byte from the UTF-8 representation of U+2010 and then appending the full representation of that same character. Correct behaviour would be for the U+0008 (backspace) character to backspace over the whole first character, not just part of it.

I'm leaving a bug task open on man-db at a lower importance because I do think man-db bears some responsibility for the tools it uses, even if they're clearly buggy. Given the historical problems with col, I have been wondering for a while if I shouldn't produce a miniature implementation of it and embed it into man. Normally duplication is bad, and it makes me feel uncomfortable, but in this case col's implementation is pretty stable and unlikely to need to vary significantly among systems.

Changed in bsdmainutils:
importance: Undecided → High
status: New → Triaged
Changed in man-db:
importance: Undecided → Low
status: New → Triaged
Changed in bsdmainutils:
status: Unknown → New
Changed in bsdmainutils (Debian):
status: New → Fix Committed
Changed in bsdmainutils (Debian):
status: Fix Committed → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

bsdmainutils 8.0.1 in Debian unstable fixes this, so I no longer expect to put the effort in to work around it in man-db. We should get bsdmainutils 8.0.1 in Lucid in the near future.

Changed in man-db (Ubuntu):
status: Triaged → Won't Fix
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.3 KiB)

This bug was fixed in the package bsdmainutils - 8.0.1ubuntu1

---------------
bsdmainutils (8.0.1ubuntu1) lucid; urgency=low

  * Resynchronise with Debian (col multibyte update fixes LP: #320842).
    Remaining changes:
    - calendar.ubuntu:
      + Update to contain 8.04 LTS, 8.10, 9.04, and 9.10.
      + Note that 6.06 was an LTS release.
    - Depend on cpp.

bsdmainutils (8.0.1) unstable; urgency=low

  * Added patch system to be able to track upstream's source more
    directly.
  * Updated ncal to the latest upstream version. (Closes: #415057)
  * New maintainers. (Closes: #543833)
  * Bumped Standards version to 3.8.3, no update needed.
  * Made ncal find the first day of the week automatically.
    (Closes: #472355)
  * Added -3 option to cal and ncal. (Closes: #497014)
  * Made ncal cope with longer month names. (Closes: #528657)
  * Document that cal always prints 8 lines. (Closes: #367299)
  * Made ncal catch incorrect year parameter. (Closes: #431930)
  * Made ncal use locale information for knowing how to display week.
    (Closes: #361223)
  * Updated col to the latest FreeBSD version.
    (Closes: #319952, #348032, #484579)
  * Patched col to recognize a single non-empty line without newline.
    (Closes: #335087)
  * Updated colrm to the latest FreeBSD version. (Closes: #516271)
  * Updated column to latest FreeBSD version. (Closes: #368384)
  * Made column react more gracefully upon reading empty fields.
    (Closes: #382638)
  * Re-added '-n' option to column and documented it as a Debian
    extension. (Closes: #485809)
  * Updated colcrt to the latest FreeBSD version.
  * Updated banner to latest FreeBSD version.
  * Renamed banner to printerbanner. (Closes: #315664)
  * Updated hexdump to its latest FreeBSD version.
  * Bumped debhelper compat level.
  * Added patch to prevent segfault in case an empty repetition is
    given. (Closes: #498232)
  * Added patch to make hd ignore -C option. (Closes: #487985)
  * Updated from to latest FreeBSD version.
  * Updated ul to latest FreeBSD version but kept our changes
  * Updated lorder to latest version from OpenBSD.
  * Re-added patch that makes lorder use signal names instead of
    numbers.
  * Updated look to latest version from FreeBSD. (Closes: #547622)
  * Reimplemented and documented -b option for look. (Closes: #264996)
  * Update write to the latest FreeBSD version.
  * Allow writing from a terminal that has mesg n set. (Closes: #455248)
  * Updated calendar binary to latest version from OpenBSD. (Closes: #503276)
  * Patched new calendar sources to use wide-character functions.
  * Created patch to put Debian specific options back into calendar
    tool. (Closes: #293689)
  * Updated all calendars from their FreeBSD sources.
    (Closes: #388153, #413900, #446547, #542229)
  * Added/Fixed some calendar entries as reported by Debian users.
    (Closes: #554561, #337311, #525925, #493759, #381114, #280176))
  * Added Kazakhstan holiday calendar. (Closes: #358609)
  * Updated Ubuntu calendar. (Closes: #474600)
  * Removed double word in manpage. (Closes: #401091)
  * Changed patch of source.data.gz mentioned in README. (Closes: #507602)
  * Added lintian override for ...

Read more...

Changed in bsdmainutils (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.