Beautiful Soup

xmlns attribute issue with XML parser with lxml 4.4.0

Bug #1840141 reported by Florian Schieder on 2019-08-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned
	beautifulsoup4 (Ubuntu)	Fix Released	High	Andreas Hasenack

Bug Description

I have imported a SVG file using the BeautifulSoup XML parser in order to adjust some attribute values, which worked just fine. When I tried to run the processed SVG in my Firefox, it was complaining about a "not well-formed XML/SVG file". The reason: After processing with BeautifulSoup, the <svg> root element contained an attribute named "xmlns:", not "xmlns" anymore. This already occurs when I'm calling

>>> BeautifulSoup(open('file.svg'), 'xml')

in my code (tested in the interactive python shell).
It seems like there is a bug with the XML parser which appends a colon to the xmlns-attribute for some reason. This can in my opinion be reproduced when importing any SVG file containing a xmlns attribute with the XML parser.

Tags:

Related branches

lp:~facelessuser/beautifulsoup/lxml-fix

Merged into lp:beautifulsoup

Leonard Richardson: Pending requested 2019-08-14

~ahasenack/ubuntu/+source/beautifulsoup4:eoan-xmlns-fix-1840141

Merged into ubuntu/+source/beautifulsoup4:ubuntu/devel at revision 76ec4c993688c3fafcc518d13eff7b8262c792ac

Christian Ehrhardt  (community): Approve on 2019-08-27

Canonical Server: Pending requested 2019-08-27

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-08-14:

This looks to be due to a change in the latest lxml 4.4.0. If you downgrade your version of lxml version to 4.3.5, it goes away.

So this is either a bug in the latest lxml, or we need to adjust somehow for how the new lxml 4.4.0 does things. This will require some more investigation.

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-08-14:

After some investigation, I think this is a bug in BeautifulSoup that is caused by changes in the latest lxml 4.4.0.

I'll have to dig into the BeautifulSoup side, but I imagine it is unconditionally joining attribute prefixes with attribute name. I suspect that before, if an element specified an attribute as xmlns, it was treated as an attribute name in lxml, but now, it is recognized as a prefix, and reported as such. This leaves us with a namespace with no attribute name.

I imagine that BeautifulSoup processes the information just as it is given and stores the prefix and then nothing for the name. On output, the prefix and empty name are joined with `:` and now the output is malformed. I guess maybe there should be some kind of conditional that checks if the attribute name is empty and then *not* join the content, and instead just output the prefix. Either that, or if BeautifulSoup sees the a prefix with no name, treat the prefix as the name, but I personally prefer the former over the later as it makes more sense.

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-08-14:

Looking into BeautifulSoup, it seems when an HTML attribute is stored, it would check if the name was None, if it was, it would store the attribute properly, treating the name as the key. But now, lxml is returning the name as an empty string. This causes bs4 to create a key in the form "prefix:". It seems the best course of action is to simply check: "if name:" instead of "if name is None:". This restores sane logic and resolves this breaking bug.

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-08-14:

I submitted a fix: https://code.launchpad.net/~facelessuser/beautifulsoup/lxml-fix/+merge/371305

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-08-26:

This is also causing the following autopkgtest failures in both Ubuntu 19.10 and Debian:

======================================================================
FAIL: test_nested_namespaces (bs4.tests.test_lxml.LXMLXMLTreeBuilderSmokeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/bs4/testing.py", line 835, in test_nested_namespaces
self.assertEqual(doc, soup.encode())
AssertionError: b'<?x[144 chars]xmlns="http://ns1/">\n<child xmlns="http://ns2[96 chars]ent>' != b'<?x[144 chars]xmlns:="http://ns1/">\n<child xmlns:="http://n[99 chars]ent>'

======================================================================
FAIL: A real XHTML document should come out *exactly* the same as it went in.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/bs4/testing.py", line 824, in test_real_xhtml_document
soup.encode("utf-8"), markup)
AssertionError: b'<?x[108 chars]xmlns:="http://www.w3.org/1999/xhtml">\n<head>[56 chars]tml>' != b'<?x[108 chars]xmlns="http://www.w3.org/1999/xhtml">\n<head><[55 chars]tml>'

----------------------------------------------------------------------
Ran 489 tests in 0.463s

FAILED (failures=2)

Changed in beautifulsoup:
status:	New → Confirmed
Changed in beautifulsoup4 (Ubuntu):
status:	New → Triaged
importance:	Undecided → High
tags:	added: rls-ee-incoming
summary:	- xmlns attribute issue with XML parser + xmlns attribute issue with XML parser with lxml 4.4.0

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-08-26:

I imagine this is a side effect of this change in lxml 4.4.0:

When using Element.find*() with prefix-namespace mappings, the empty string is now accepted to define
a default namespace, in addition to the previously supported None prefix. Empty strings are more
convenient since they keep all prefix keys in a namespace dict strings, which simplifies sorting etc.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-08-26:

Revision 524 includes a fix based on Isaac's pull request.

Changed in beautifulsoup:
status:	Confirmed → Fix Committed

Andreas Hasenack (ahasenack) on 2019-08-27

Changed in beautifulsoup4 (Ubuntu):
assignee:	nobody → Andreas Hasenack (ahasenack)
status:	Triaged → In Progress

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-08-27:

This bug was fixed in the package beautifulsoup4 - 4.8.0-1ubuntu1

---------------
beautifulsoup4 (4.8.0-1ubuntu1) eoan; urgency=medium

* d/p/fix-definition-default-xml-namespace.patch: fixed the definition of the
default XML namespace with lxml 4.4 (LP: #1840141)

-- Andreas Hasenack <email address hidden> Tue, 27 Aug 2019 14:36:51 -0300

Changed in beautifulsoup4 (Ubuntu):
status:	In Progress → Fix Released

Leonard Richardson (leonardr) on 2019-11-11

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.