misinterpreted chars with file:// links in iso8859-x encoded docs

Bug #50213 reported by Daniel Musketa
8
Affects Status Importance Assigned to Milestone
KDE Base
Fix Released
Medium
kdebase (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Binary package hint: konqueror

Encoding the following as iso8859-15 and loading with kubuntu 6.06's Konqueror 3.5.2 leads to wrong links:

<meta http-equiv="Content-Type" content="text/html; charset=iso8859-15" />
 <a href="file://Müller">Link with "&uuml;" as 0xFC</a><br />
 <a href="file://M&uuml;ller">Link with "&uuml;" as &amp;uuml;</a><br />
 <a href="file://M%FCller">Link with "&uuml;" as %FC</a>

Revision history for this message
Julian Olivien (julianolivien) wrote :

On edgy with konqueror 3.3.5 only the first link is not working. the other two links are pointing to the correct file.

the first one links to:
file://mã?ller

note: configured ubuntu to use german at install time

Revision history for this message
Frank Siegert (fsiegert) wrote :

Interestingly, there is a difference whether you serve the file via Apache or view it directly from your hard drive:

Apache 2.0.55 + Konqueror 3.5.5 ==> the first two links work, while the third doesn't (file://m??ller)
Directly + Konqueror 3.5.5 ==> Same as Titus mentions.

Can you confirm?

Revision history for this message
Micah Cowan (micahcowan) wrote :

What Content-Type header is Apache giving in that case? If the charset parameter is unset, or is set to something other than iso-8859-15, there's small wonder.

Also, note that charset=iso8859-15 is wrong; it should have a dash between iso and 8859.

Marking as Confirmed, since Titus & Frank have indicated that they've both observed the described behavior.

Changed in kdebase:
status: Unconfirmed → Confirmed
Revision history for this message
Frank Siegert (fsiegert) wrote :

Ah, that's interesting. With the dash, i.e. charset=iso-8859-15 it works fine viewed directly (file://) but using Apache to serve it, only the middle one works.

But actually it is not correct for Konqueror to have the file links with _two_ slashes, but it expects three slashes just like Firefox (at least version 3.5.5 does). And if you use three slashes (i.e. file:///müller), you have the exactly same behaviour in Konqeror as in Firefox:

view local file: all three links work
view it through Apache: only the middle one works.

So you seem to be right, that it is the Apache header which somehow messes it up.
So this doesn't seem to be a Konqueror bug, if a bug at all. Can you confirm, Daniel?

Revision history for this message
Micah Cowan (micahcowan) wrote :

Does it work right if you configure Apache with something like «AddDefaultCharset iso-8859-15»?

Revision history for this message
Frank Siegert (fsiegert) wrote :

I have to take my statement back; I messed something up in my previous comment.

When using the real file links, i.e. file:/// the encoding doesn't work in Konqueror:
- from a local file all three lines show up scrambled as file:///MÃŒller (this works fine in firefox)
- from apache server, only the middle one works (same problem in firefox)

Adding "AddDefaultCharset iso-8859-15" made all three links bad in Konqueror, while in Firefox nothing changed (i.e. only the middle one works).

Can somebody confirm these, to make sure I didn't mess anything up again?

Revision history for this message
Micah Cowan (micahcowan) wrote :

I strongly suspect you've accidentally saved that file in UTF-8 rather than ISO-8859-15; those are /exactly/ the characters that should result from interpreting UTF-8 text as ISO-8859-15.

Using the appropriately-dashed version, saved with ISO-8859-15 encoding, here's the results I get: all three show correctly on Firefox (v2.0.0.3) via both Apache (serving it with a charset param of utf-8; the http-equiv overrides it for Firefox) and file:///home/micah/Desktop/test.html. Ditto for Konqueror 3.5.5. (I'm on Edgy Eft).

I'm not sure how two-slash file:// ought to work, as the specs require that file:// be followed by an absolute pathname, IIRC.

Revision history for this message
Micah Cowan (micahcowan) wrote :

Disregard that: that's absolutely incorrect. I was being stupid. Results with UTF-8 in the headers are:

Konqueror: mller, müller & m??ller
Firefox: m�ller, xn--mller-kva & m%fcller

From file:///home/micah/Desktop/test.html:

Konqueror: all good.
Firefox: xn--mller-kva, xn--mller-kva, müller

With ISO-8859-1 in the Apache headers (has ü in the same code position as ISO-8859-15, so equivalent):

Konqueror: all good.
Firefox: xn--mller-kva, xn--mller-kva, müller

And finally, changing the file:// links to use three slashes, for both Apache (with correct headers) and file:///home/micah/Desktop/test.html:

Konqueror: MÃŒller, MÃŒller, MÃŒller (as Frank reported).
Firefox: all good.

It appears that Konqueror is correctly recognizing the ü via ISO-8859, and then transliterating it as UTF-8 internally, and then transliterating /that/ back out to ISO-8859-15 for the links. Definitely screwy. I also tested Konqueror with fully-correct HTML tags (including <html> and <head>), and got the same results.

Revision history for this message
Micah Cowan (micahcowan) wrote :

> Firefox: xn--mller-kva, xn--mller-kva, müller

(Note that, apparently Firefox has it's own special understanding for two-slash file://; that's probably fine, since file:// isn't legal anyway; it works fine with three slashes.)

Micah Cowan (micahcowan)
Changed in kdebase:
importance: Undecided → Medium
Revision history for this message
Micah Cowan (micahcowan) wrote :

(that special understanding being to interpret it as a hostname, and encode it via punycode.)

Revision history for this message
Micah Cowan (micahcowan) wrote :

Further research has revealed that, in any case, that use of %FC in a URI to represent ü is inappropriate. HTTP URIs are only defined when the percent-encodings represent US-ASCII characters. If they do not, it is not specified how they should be interpreted; however, modern specifications such as RFC 3987, Internationalized Resource Identifiers, dictate that they be interpreted as UTF-8. RFC 3986, defining URIs, also recommends that future URI schemes that allow for character sets beyond ASCII, use percent-encoded UTF-8.

Firefox, it turns out, is smart enough to treat both %FC by itself (which is not a valid UTF-8 encoding), and %C3%BC (the UTF-8 encoding of ü) the same (which is apparently what the W3C recommends). Konqueror, unfortunately, currently treats the latter even worse than the first; both are broken.

Revision history for this message
Micah Cowan (micahcowan) wrote :

Linked with upstream bug. While this bug doesn't necessarily look like it's related to this one, close inspection reveals that it is; several bugs very similar to this one have been marked as duplicates of upstream bug #55177.

Revision history for this message
Micah Cowan (micahcowan) wrote :

(Disregard the link that launchpad automatically generated for the upstream bug, above. It is invalid [links to LP bug].)

Revision history for this message
Micah Cowan (micahcowan) wrote :

Upstream says they believe the problem is fixed in KDE 4 (currently in alpha, I think?), but *cannot* be fixed in KDE 3.

This means we're unlikely to see a resolution for this problem until KDE 4 is released, and available under Ubuntu. :-/

Changed in kdebase:
status: Unknown → Confirmed
Revision history for this message
Jonathan Thomas (echidnaman) wrote :

Fixed with Konqueror 4.1, released in Intrepid

Changed in kdebase:
status: Confirmed → Fix Released
Changed in kdebase:
importance: Unknown → Medium
Changed in kde-baseapps:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.