mawk counts characters incorrectly

Bug #1462737 reported by Jarno Suni
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mawk (Ubuntu)
New
Undecided
Unassigned

Bug Description

$ echo ä | mawk '{print length($0)}'
outputs 2. I expect 1.

$ echo äo | mawk '{print match($0,"o")}'
outputs 3. I expect 2.

Probably this is due to the internal representation of UTF-8 characters; mawk counts bytes instead of characters. gawk works similarly, if -b option is used.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: mawk 1.3.3-17ubuntu2
ProcVersionSignature: Ubuntu 3.13.0-53.89-lowlatency 3.13.11-ckt19
Uname: Linux 3.13.0-53-lowlatency x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
CurrentDesktop: XFCE
Date: Sun Jun 7 15:52:26 2015
Dependencies:
 gcc-4.9-base 4.9.1-0ubuntu1
 libc6 2.19-0ubuntu6.6
 libgcc1 1:4.9.1-0ubuntu1
 multiarch-support 2.19-0ubuntu6.6
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-09-21 (259 days ago)
InstallationMedia: Ubuntu-Studio 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.1)
SourcePackage: mawk
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Jarno Suni (jarnos) wrote :
Revision history for this message
Jarno Suni (jarnos) wrote :

I guess it is design. I think some operations are faster, if you count bytes instead of characters. There could be an option to allow mawk count characters, though.

description: updated
Revision history for this message
Jarno Suni (jarnos) wrote :

Or better, it should work same way as gawk, i.e. treat all input data as single-byte characters, only if -b or --characters-as-bytes option is used.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.