The wrong encoding is chosen if '*' is in the Accept-Charset header, but 'utf-8' isn't

Bug #40329 reported by Björn Tillenius
4
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Medium
Björn Tillenius

Bug Description

If '*' is present the Accept-Charset header, but 'utf-8' isn't, Launchpad might oops, since the wrong encoding is used to encode the page. 'utf-8' should be chosen, but isn't. See OOPS-109B542 for an example.

This has been reported upstream as http://www.zope.org/Collectors/Zope3-dev/587

Revision history for this message
Diogo Matsubara (matsubara) wrote :

oops report never lie

Changed in launchpad:
status: Unconfirmed → Confirmed
Revision history for this message
James Henstridge (jamesh) wrote :

We've also got OOPS-109C83:

HTTP_USER_AGENT=Mozilla/5.0 (compatible; Konqueror/3.0.0-10; Linux)
HTTP_ACCEPT_CHARSET=ISO-8859-1

Here, the user's browser has reported that it can only handle ISO-8859-1 (and nothing else), so getting Zope to serve UTF-8 if "*" is included in the accepted languages wouldn't fix the problem for this person.

I wonder what should be done for situations like this?

Revision history for this message
James Henstridge (jamesh) wrote :

One other thing to consider here is the encoding of returned data.

Pretty much none of our forms use the accept-charset attribute on the <form> element, so the web browser will generally use the encoding of the page when sending back data.

While we are serving pages as UTF-8, everything works fine. If we sometimes serve pages as ISO-8859-1, then users may sometimes send us ISO-8859-1 data in form posts.

Revision history for this message
Björn Tillenius (bjornt) wrote : Re: [Bug 40329] Re: Oops if 'utf-8' isn't in HTTP_ACCEPT_CHARSET

On Fri, Apr 21, 2006 at 05:41:32AM -0000, James Henstridge wrote:
> While we are serving pages as UTF-8, everything works fine. If we
> sometimes serve pages as ISO-8859-1, then users may sometimes send us
> ISO-8859-1 data in form posts.

Yes, this is true. Zope deals with encoding/decoding errors like this
badly. But I guess the reason no one has cared much about this, is
because I have not yet seen a client not accepting UTF-8. And if they
do accept UTF-8, Zope will choose that to encode the page.

description: updated
Revision history for this message
Björn Tillenius (bjornt) wrote :

On Fri, Apr 21, 2006 at 01:21:14AM -0000, James Henstridge wrote:
> We've also got OOPS-109C83:
>
> HTTP_USER_AGENT=Mozilla/5.0 (compatible; Konqueror/3.0.0-10; Linux)
> HTTP_ACCEPT_CHARSET=ISO-8859-1
>
> I wonder what should be done for situations like this?

This is a different problem, so I edited the description of this bug to
reflect the problem in OOPS-109B542, and filed bug 40494 about this
problem.

Changed in launchpad:
assignee: nobody → bjornt
status: Confirmed → In Progress
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Stuart Bishop (stub) wrote :

I think we need to ignore the header entirely, always emitting UTF-8. If we emit pages in a charset other than UTF-8, then the client will submit forms with an encoding other than UTF-8 which will generate exceptions or cause us to store corrupted data.

We *could* emit a custom error page informing the user that their client must accept UTF-8, but I think that this would be counterproductive as it is most likely that the browser is simply lying and UTF-8 encoded data is perfectly acceptible.

Revision history for this message
Björn Tillenius (bjornt) wrote : Re: [Bug 40329] Re: The wrong encoding is chosen if '*' is in the Accept-Charset header, but 'utf-8' isn't

On Mon, Apr 24, 2006 at 02:16:07PM -0000, Stuart Bishop wrote:
> I think we need to ignore the header entirely, always emitting UTF-8.
> If we emit pages in a charset other than UTF-8, then the client will
> submit forms with an encoding other than UTF-8 which will generate
> exceptions or cause us to store corrupted data.

Well, If we send the page as ISO-8859-1 the user will send us data
encoded as ISO-8859-1, and everything will be fine. We decode the POST
data using the same encoding we will send that page in. I'm not sure what
will happen if the user tries to POST non-ISO-8859-1 characters, though.

Anyway, I agree that we should ignore the header and always send UTF-8.
I think that will cause us and the users the least amount of problems.

Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 40329] Re: [Bug 40329] Re: The wrong encoding is chosen if '*' is in the Accept-Charset header, but 'utf-8' isn't

Björn Tillenius wrote:

> Well, If we send the page as ISO-8859-1 the user will send us data
> encoded as ISO-8859-1, and everything will be fine. We decode the POST
> data using the same encoding we will send that page in. I'm not sure what
> will happen if the user tries to POST non-ISO-8859-1 characters, though.

Zope3 can only decode if this is a POST request. Our search forms use GET
(to make them bookmarkable) - I suspect this data is assumed to be UTF-8.

> Anyway, I agree that we should ignore the header and always send UTF-8.
> I think that will cause us and the users the least amount of problems.

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.