xmlns attribute issue with XML parser with lxml 4.4.0

Bug #1840141 reported by Florian Schieder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned
beautifulsoup4 (Ubuntu)
Fix Released
High
Andreas Hasenack

Bug Description

I have imported a SVG file using the BeautifulSoup XML parser in order to adjust some attribute values, which worked just fine. When I tried to run the processed SVG in my Firefox, it was complaining about a "not well-formed XML/SVG file". The reason: After processing with BeautifulSoup, the <svg> root element contained an attribute named "xmlns:", not "xmlns" anymore. This already occurs when I'm calling

>>> BeautifulSoup(open('file.svg'), 'xml')

in my code (tested in the interactive python shell).
It seems like there is a bug with the XML parser which appends a colon to the xmlns-attribute for some reason. This can in my opinion be reproduced when importing any SVG file containing a xmlns attribute with the XML parser.

Related branches

Revision history for this message
Isaac Muse (facelessuser) wrote :

This looks to be due to a change in the latest lxml 4.4.0. If you downgrade your version of lxml version to 4.3.5, it goes away.

So this is either a bug in the latest lxml, or we need to adjust somehow for how the new lxml 4.4.0 does things. This will require some more investigation.

Revision history for this message
Isaac Muse (facelessuser) wrote :

After some investigation, I think this is a bug in BeautifulSoup that is caused by changes in the latest lxml 4.4.0.

I'll have to dig into the BeautifulSoup side, but I imagine it is unconditionally joining attribute prefixes with attribute name. I suspect that before, if an element specified an attribute as xmlns, it was treated as an attribute name in lxml, but now, it is recognized as a prefix, and reported as such. This leaves us with a namespace with no attribute name.

I imagine that BeautifulSoup processes the information just as it is given and stores the prefix and then nothing for the name. On output, the prefix and empty name are joined with `:` and now the output is malformed. I guess maybe there should be some kind of conditional that checks if the attribute name is empty and then *not* join the content, and instead just output the prefix. Either that, or if BeautifulSoup sees the a prefix with no name, treat the prefix as the name, but I personally prefer the former over the later as it makes more sense.

Revision history for this message
Isaac Muse (facelessuser) wrote :

Looking into BeautifulSoup, it seems when an HTML attribute is stored, it would check if the name was None, if it was, it would store the attribute properly, treating the name as the key. But now, lxml is returning the name as an empty string. This causes bs4 to create a key in the form "prefix:". It seems the best course of action is to simply check: "if name:" instead of "if name is None:". This restores sane logic and resolves this breaking bug.

Revision history for this message
Isaac Muse (facelessuser) wrote :
Revision history for this message
Brian Murray (brian-murray) wrote :

This is also causing the following autopkgtest failures in both Ubuntu 19.10 and Debian:

======================================================================
FAIL: test_nested_namespaces (bs4.tests.test_lxml.LXMLXMLTreeBuilderSmokeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/bs4/testing.py", line 835, in test_nested_namespaces
    self.assertEqual(doc, soup.encode())
AssertionError: b'<?x[144 chars]xmlns="http://ns1/">\n<child xmlns="http://ns2[96 chars]ent>' != b'<?x[144 chars]xmlns:="http://ns1/">\n<child xmlns:="http://n[99 chars]ent>'

======================================================================
FAIL: A real XHTML document should come out *exactly* the same as it went in.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/bs4/testing.py", line 824, in test_real_xhtml_document
    soup.encode("utf-8"), markup)
AssertionError: b'<?x[108 chars]xmlns:="http://www.w3.org/1999/xhtml">\n<head>[56 chars]tml>' != b'<?x[108 chars]xmlns="http://www.w3.org/1999/xhtml">\n<head><[55 chars]tml>'

----------------------------------------------------------------------
Ran 489 tests in 0.463s

FAILED (failures=2)

Changed in beautifulsoup:
status: New → Confirmed
Changed in beautifulsoup4 (Ubuntu):
status: New → Triaged
importance: Undecided → High
tags: added: rls-ee-incoming
summary: - xmlns attribute issue with XML parser
+ xmlns attribute issue with XML parser with lxml 4.4.0
Revision history for this message
Leonard Richardson (leonardr) wrote :

I imagine this is a side effect of this change in lxml 4.4.0:

 When using Element.find*() with prefix-namespace mappings, the empty string is now accepted to define
 a default namespace, in addition to the previously supported None prefix. Empty strings are more
 convenient since they keep all prefix keys in a namespace dict strings, which simplifies sorting etc.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Revision 524 includes a fix based on Isaac's pull request.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Changed in beautifulsoup4 (Ubuntu):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package beautifulsoup4 - 4.8.0-1ubuntu1

---------------
beautifulsoup4 (4.8.0-1ubuntu1) eoan; urgency=medium

  * d/p/fix-definition-default-xml-namespace.patch: fixed the definition of the
    default XML namespace with lxml 4.4 (LP: #1840141)

 -- Andreas Hasenack <email address hidden> Tue, 27 Aug 2019 14:36:51 -0300

Changed in beautifulsoup4 (Ubuntu):
status: In Progress → Fix Released
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.