i18n: Header Filter Rules (& fix) - rules don't match if header characters aren't representable in cset of list's preferred language.

Bug #558155 reported by hatukanezumi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Fix Released
Low
Mark Sapiro

Bug Description

- Decode headers to be matched.

- Normalize header & pattern so that compatibility
characters
  (fullwidth forms of ASCII, Compatibility Ideographs etc.)
  will be matched. Normalization Form KC (NFKC) is used.
  note: This feature is available on Python >= 2.3.

- Fix: Ignore empty lines in pattern to prevent matching
  any strings.

Related branches

Revision history for this message
hatukanezumi (hatukanezumi-users-sf) wrote :

Logged In: YES
user_id=529503

Error handlings are added.

Revision history for this message
hatukanezumi (hatukanezumi-users-sf) wrote :

The file mailman-2.1.5-unicode_headermatch.patch was added: for 2.1.5-release

Mark Sapiro (msapiro)
summary: - i18n: Header Filter Rules (& fix)
+ i18n: Header Filter Rules (& fix) - rules don't match if header
+ characters aren't representable in cset of list's preferred language.
Revision history for this message
Mark Sapiro (msapiro) wrote :

Portions of this patch, but not the Unicode normalization have been applied or otherwise addressed in MM versions 2.1.6 through 2.1.15.

I intend to deal with the spirit of the rest by converting the headers to the cset of the list's preferred language using encode(errors='backslashreplace') instead of encode(errors='replace'). In this way, these characters will be converted to '\uxxxx' escapes rather than '?', and header_filter_rules patterns can be constructed to match them.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Low
milestone: none → 2.1.23
status: New → In Progress
Revision history for this message
Mark Sapiro (msapiro) wrote :

The committed fix together with prior changes implements a few of the things in this patch. It does not do the Unicode normalization portion of this patch. I was mostly trying to address the issue of trying to recognize Chinese spam by detecting Chinese characters in message headers.

I understand that the normalization can be important to actually match specific things in subjects in say Japanese on a Japanese language list. If that is still desired, please submit a new patch against the current code base.

Changed in mailman:
status: In Progress → Fix Committed
Revision history for this message
Mark Sapiro (msapiro) wrote :

I have committed another change at http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/revision/1664 which does the conversion to unicode and the unicode normalization, so everything in this patch has now been committed, albeit in a somewhat different way. Refer to the NEWS item in rev 1664 for more details.

Mark Sapiro (msapiro)
Changed in mailman:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.