Leap2A import problem: "simplexml_load_file()... parser error : PCDATA invalid Char value..."

Bug #1482410 reported by Aaron Wells
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mahara
Fix Released
Medium
Aaron Wells
15.04
Fix Released
Medium
Unassigned
15.10
Fix Released
Medium
Aaron Wells

Bug Description

We had a report of a Mahara-generated Leap2a file that caused this crash stack upon attempting to import it:

[WAR] 38 (import/leap/lib.php:126) simplexml_load_file(): /home/aaronw/dataroot/mahara/temp/import/admin-1438900425/extract/leap2a.xml:1808: parser error : PCDATA invalid Char value 11
Call stack (most recent first):

    log_message("simplexml_load_file(): /home/aaronw/dataroot/mahar...", 8, true, true, "/home/aaronw/www/mahara/htdocs/import/leap/lib.php", 126) at /home/aaronw/www/mahara/htdocs/lib/errors.php:441
    error(2, "simplexml_load_file(): /home/aaronw/dataroot/mahar...", "/home/aaronw/www/mahara/htdocs/import/leap/lib.php", 126, array(size 2)) at Unknown:0
    simplexml_load_file("/home/aaronw/dataroot/mahara/temp/import/admin-143...", "SimpleXMLElement", 67584) at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:126
    PluginImportLeap->read_leap2a_xml_file() at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:147
    PluginImportLeap->build_default_load_mapping() at /home/aaronw/www/mahara/htdocs/import/leap/lib.php:164
    PluginImportLeap->process(1) at /home/aaronw/www/mahara/htdocs/import/index.php:245
    import_submit(object(Pieform), array(size 3)) at Unknown:0
    call_user_func_array("import_submit", array(size 2)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:537
    Pieform->__construct(array(size 6)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:164
    Pieform::process(array(size 6)) at /home/aaronw/www/mahara/htdocs/lib/pieforms/pieform.php:71
    pieform(array(size 6)) at /home/aaronw/www/mahara/htdocs/import/index.php:171
    print_upload_form() at /home/aaronw/www/mahara/htdocs/import/index.php:61

Upon investigation it turned out that the leap2a XML file had a Vertical Tab character (ASCII x0A) in one of the page titles. There is a whole range of ASCII control characters that will cause a parser error in SimpleXML, and if they're placed in a Mahara page title, they will be included in the output of the Leap2a file, which will cause Mahara to crash when it attempts to import the file.

Tags: leap2a xml
Revision history for this message
Aaron Wells (u-aaronw) wrote :

Using a for-loop and the PHP chr() command, I individually tested each ASCII character in the middle of an otherwise acceptable XML file. I tested them plain, after passing through htmlspecialchars(), and after passing through htmlentities(). Here is the list of the decimal integer codes for the ASCII characters that cause SimpleXML to choke. None of them are escaped by htmlspecialchars() or htmlentities().

$baddies = array(0,1,2,3,4,5,6,7,8,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31);

To help with testing (because it's not always easy to generate these characters), I've attached a file controlcharacters.txt which contains all 29 of these characters, in between <bad> tags. Depending on your text editor, opening it you may just see "<bad></bad>". But if you select the whole thing and paste it into a Mahara page title, you should be able to replicate the problem

To replicate:

1. Create a Mahara page with one or more of the forbidden characters in its page title
2. Export the page to Leap2a
3. Import the Leap2a file back into Mahara

Expected result: You've imported a copy of the page
Actual result: You get an error stack with the SimpleXML parser error as part of it.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

On further research I've decided to be more thorough and just whitelist the allowed characters in XML, listed here: https://en.wikipedia.org/wiki/Valid_characters_in_XML

I wound up using preg_replace() with the "/u" modifier to make it Unicode-safe. The downside to this is that we read the entire file into memory and then do preg_replace on it, but that shouldn't use too much more memory, because we're already reading the entire file into memory in order to use simplexml.

I also discovered that htmlentities() can get rid of these invalid characters if you use the flags ENT_XML1 | ENT_DISALLOWED flags. But those flags were only added in PHP 5.4, and we still aim to support PHP 5.3. Plus, the best they can do is replace the invalid characters with a Unicode 0xFFDD character, which will display as an unprintable character. So, it's still better to just remove them entirely.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

The patches for this had the wrong bug number on them.

Patch for master (15.10dev): https://reviews.mahara.org/#/c/5061/
Patch for 15.04: https://reviews.mahara.org/#/c/5080/

Changed in mahara:
status: In Progress → Fix Committed
Robert Lyon (robertl-9)
Changed in mahara:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.