Encode.pm 2.54 breaks database functions (naco_normalize, maintain_control_numbers, others)

Bug #1242999 reported by Dan Scott
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
High
Unassigned
2.5
Fix Released
Undecided
Unassigned

Bug Description

* Evergreen master
* PostgreSQL 9.2.4
* Fedora 19

Fedora recently upgraded its version of the Perl Encode.pm module to 2.54. Presumably other distributions will begin to run into this problem as well.

After that upgrade, I noticed that the --load-all-sample flag for eg_db_config started to fail with the error:

ERROR: Cannot decode string with wide characters at /usr/lib64/perl5/vendor_perl/Encode.pm line 216.

... when trying to load the authority sample records.

It turns out that the behaviour of Encode.pm changed with respect to the UTF8 flag for strings when decode_utf8() is called - probably https://github.com/dankogai/p5-encode/pull/11.

Changing the decode_utf8() calls to NFD() -- per Tom Christiansen's cookbook recipe that you should normally normalize unicode input to decomposed format on the way in (http://www.perl.com/pub/2012/05/perlunicookbook-unicode-normalization.html) -- enables eg_db_config to succeed. We have to repeat this for the other affected pl/Perl functions.

However, after that, "authority_control_fields.pl --all" fails with:

Record # 133 : Exception: OpenSRF::EX::ERROR 2013-10-21T22:35:48 main /openils/bin/authority_control_fields.pl:486 System ERROR: Exception: OpenSRF::DomainObject::oilsMethodException 2013-10-21T22:35:48 OpenSRF::AppRequest /usr/local/share/perl5/OpenSRF/AppSession.pm:1086 <500> *** Call to [open-ils.search.authority.validate.tag.id_list] failed for session [1382409348.630998625.26807665569], thread trace [1]:
Exception: OpenSRF::EX::ERROR 2013-10-21T22:35:48 OpenSRF::Application /usr/local/share/perl5/OpenSRF/Application.pm:233 System ERROR: Exception: OpenSRF::DomainObject::oilsMethodException 2013-10-21T22:35:48 OpenSRF::AppRequest /usr/local/share/perl5/OpenSRF/AppSession.pm:1086 <500> *** Call to [open-ils.storage.authority.validate.tag.id_list] failed for session [1382409348.6344521302.2617860606], thread trace [1]:
Cannot decode string with wide characters at /usr/lib64/perl5/vendor_perl/Encode.pm line 216.

Wide character in warn at /usr/local/share/perl5/MARC/Charset.pm line 283.
no mapping found at position 4 in [シュプリンガー・フェアラーク東京], at /usr/local/share/perl5/MARC/Charset.pm line 283.

MARC::Charset is currently packaged at version 1.33 on Fedora, so it is possible a newer version mitigates the problem.

My work in progress branch, for what it's worth, is at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/encode-changed-behaviour

Revision history for this message
Dan Scott (denials) wrote :

Oh hey, noticed that "prove -l lib t" in the perlmods directory turns up:

t/14-OpenILS-Utils.t .......................... 14/30
# Failed test 'clean_marc: diacritics'
# at t/14-OpenILS-Utils.t line 89.
# got: '&#xFFFD;&#xFFFD;&#xFFFD;&#xFFFD;&#xFFFD;&#xFFFD;'
# expected: '&#xE8;&#xF6;&#xE7;&#xC7;&#xC8;&#xC0;'
# Looks like you failed 1 test of 30.
t/14-OpenILS-Utils.t .......................... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/30 subtests

So it looks like the functions outside of pl/Perl get hit by this too. Great!

Revision history for this message
Dan Scott (denials) wrote :

Pushed another commit onto the branch that adds tests for OpenILS::Application::AppUtils::entityize() and which fixes the clean_marc() test. Interestingly, requesting the NFD form for entityize() failed horribly, so I've told it to ignore the 'form' flag. Perhaps this was never really tested in the past, in which case, huzzah for unit tests.

authority_control_fields.pl --all finishes successfully with the eg_db_config --load-all sample set now, and all tests pass, so there's that. Of course, this really needs to be tested with Ubuntu and Debian releases of various vintages...

Revision history for this message
Dan Scott (denials) wrote :

FWIW, currently Ubuntu Saucy has Encode.pm 2.51 (http://packages.ubuntu.com/saucy/libencode-perl), so unless someone is running Debian Jessie (which has 2.55 currently - http://packages.debian.org/jessie/libencode-perl) or Fedora, or has manually upgraded their version of Encode, they're unlikely to hit the changed decode_utf8() behaviour.

Revision history for this message
Dan Scott (denials) wrote :

Thanks to Jason Stephenson for testing; my focus had been on the failing t/14-OpenILS-Utils.t but t/01-Application-AppUtils.t started failing once the fixes were in. Adding "use utf8" to the unit test resolves that, as well; I've pushed that tiny commit onto the branch. This has now been tested on Fedora 19 and Ubuntu 12.04.

Revision history for this message
Dan Wells (dbw2) wrote :

Thanks for all your work on this, Dan. As best I can understand from the change to Encode.pm highlighted in the bug description, we were relying on a hole in Encode which allowed our calls to decode_utf8() to simply do nothing and return. We are now getting these warnings because our code was always trying to decode something already decoded, but now it *really* tries, and fails.

Can we solve this problem by simply removing the unnecessary decode_utf8() without adding all the extra normalization routines? I understand Tom's suggestion to normalize at application "boundaries", but I am not sure all of these functions are, in fact, boundaries. In any case, I don't think we should ever be doing:

NFC(NFD($xml));

as that takes more work to produce exactly the same results as:

NFC($xml);

doesn't it?

I know we have vacillated a couple times in the past, but I think we settled on using NFC for Evergreen internals (going by memory; it might have been NFD). I don't think we are well served by decomposing and recomposing the data in so many places. We should (in my opinion) be doing NFC() at actual possible points of UTF8 data input and be done with it.

Revision history for this message
Dan Scott (denials) wrote : Re: [Bug 1242999] Re: Encode.pm 2.54 breaks database functions (naco_normalize, maintain_control_numbers, others)

Right, NFC(NFD()) is the artifact of late night silliness.

Dan - I take it you're going to post an alternative branch?

Revision history for this message
Dan Wells (dbw2) wrote :

Sure, I can post an alternative branch if it will be helpful, though I am not sure if I'm that far along in having a bona fide solution. I am more trying to brainstorm and make sure we collectively understand (as best we can) what is going on under the hood. It certainly isn't my intention to try and take over on this, but just to provide an extra set of eyes and whatever few brain cycles I can muster. I know for me personally, working on bugs like this can be difficult to wrap my head around, since there are many moving parts, particular at layers which are hard to see (since they are, by design, at a level below that which is generally displayed).

One issue in particular which concerns me is figuring out how dependent we might be on the old, broken, but "magic" behavior of decode_utf8(). I'll try to do some poking tomorrow with just the decode_utf8() calls removed, see what I can break, and report back if I learn anything of value. I will also post a branch at that point provided my tests don't fail completely :)

Revision history for this message
Dan Scott (denials) wrote :

Dan: Can I suggest building on my branch then, given that all of the actual unit tests & the test I added are currently passing on both Fedora 19 and Ubuntu Precise? We can always squash down the commits if it turns out that the NFD() / NFC() calls are unnecessary or whatever, but I doubt that we'll be want to remove the "use utf8;" bits from the unit tests or the added unit test.

Revision history for this message
Dan Scott (denials) wrote :

I went ahead and pushed another commit onto the branch that removes the NFC() / NFD() calls from the code I touched. It works.

Note that the remaining decode_utf8() calls at this point are in SIP and SuperCat; it's entirely possible that those are still required because the input is coming from outside, but hey, I haven't tested them either way.

Revision history for this message
Dan Wells (dbw2) wrote :

Dan, thanks for pushing the new commits. The simplified code in your latest branch is working for me so far. I have pushed a collab branch with two additional commits. From the commit messages:

1) Change NFD test to look for decomposed characters
If data is NFD, we should expect diacritics to be separate from the base characters, so let's update the test to look for valid NFD data.

2) Restore decomposed support in entityize()
Now that we have updated our test check, let's restore the missing code bits to pass the test.

One other thing I noticed is that we ultimately lost an NFC() call in both maintain_control_numbers() and maintain_901() when compared to current master. Overall, I believe removing those is a good thing, as these functions have no direct user input. However, we did lose a possible benefit, as these functions appear to have had a side effect of NFC normalizing any records inserted directly into the DB.

While in most cases overkill, I am in favor of ensuring NFC normalization on direct record INSERT, but I also think we should be more intentional and explicit about it. Rather than adding more Perl trigger overhead, perhaps we could rename maintain_901 to maintain_901_and_nfc(), and re-add the NFC call only there? What do others think? Or is it better to simply say "if you insert to the DB, normalization is your responsibility"? (Of course, if we did that, we would have to make sure we follow our own advise ;) )

Revision history for this message
Dan Wells (dbw2) wrote :
Revision history for this message
Dan Wells (dbw2) wrote :

Sorry again, (need to slow down!):

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dbwells/encode-changed-behaviour

working/collab/dbwells/encode-changed-behaviour

Revision history for this message
Dan Scott (denials) wrote :

Dan, your commits look great. I'm worried though that this bug is getting away from "Make it possible to install and run Evergreen on systems with current Encode.pm". We are absolutely broken today on an environment that will become common in the relatively near future, and it would be good to get that back up and running.

I absolutely do think we should focus on the refinement of all of the Unicode normalization, but I think that can and should follow "let's get master working again" in a separate but linked bug.

As well, in IRC a few days back Galen mentioned that we should put pgTAP to work for this problem. I think this makes sense, so that we can ensure that the records as they exist in biblio.record_entry.marc are unchanged by this change, at the very least.

Looks like the pgTAP tests are in Open-ILS/src/sql/Pg/make-pgtap-tests.pl. I think we should grab an MD5 hash of some of the more "interesting" (as in, lots of non-ASCII character) records as they are imported in a current Ubuntu Precise environment with current Evergreen master and older Encode.pm, so that we can compare that hash with the results of importing with this current branch, as well as importing with this current branch on a system with an updated Encode.pm.

Once we have a working master branch that does not change any results as confirmed by tests in our test suite, then I think we should commit that work to master (and backport to 2.5 perhaps), then look at potentially changing behaviour.

Revision history for this message
Dan Wells (dbw2) wrote :

Dan, I agree 100% with your assessment. The only thing I would add is that my NFC() related suggestion wasn't meant to change current behavior, but in fact the opposite, to preserve it. Current master already has an NFC() call in maintain_901(), and this effectively means NFD data is prevented at the trigger level. Through the commits on this bug, that NFC call got removed, so I am only recommending we add it back where it is in master, but also rename the function so future editors of this code know it was there intentionally (since it isn't necessary for simply maintaining the 901).

Maybe that was all understood, but I just wanted to be clear that my intention is to make fewer changes. I think of renaming the function more as "documentation", but if that's at all getting in the way, I will simply drop that part of the suggestion.

This would probably be a lot clearer if I just made a squashed branch so we can all more easily see what is really changing and what isn't. I will post one in a bit.

Revision history for this message
Dan Wells (dbw2) wrote :

Branch squashed and rebased at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/dbwells/encode-changed-behaviour-rebased-squashed

working/collab/dbwells/encode-changed-behaviour-rebased-squashed

Dan, I maintained your authorship, but I only kept my signoff in these commits, since they are well mixed, mingled, and mangled. If you diff this with the other branch, you will see the only change is retaining the NFC() calls which already exist in master. I was previously arguing to keep just one, but to better follow the principle of fewest changes to fix the current bug, I left both.

Revision history for this message
Dan Scott (denials) wrote :

Heh, I had tried to post the following comment an hour or two ago and found that Launchpad had failed for some reason - I think we're on the same wavelength:

"""
Dan - right, I understood what your commits were doing, that's why I said they look great. I'm no QA specialist, but I do want to focus on having automated, repeatable tests (that is, actually adding unit tests via pgTAP) so that we are able to prove that what we think is happening is actually happening, and also catch any unanticipated changes in future behaviour.

I understand the sentiment of renaming the function, but that means altering the basic database schema for documentation purposes. Would adding a COMMENT ON noting that the function also enforces NFC be sufficient instead?
"""

So I'll try to add a pgTAP test or two accordingly to your squashed branch.

Revision history for this message
Dan Scott (denials) wrote :

Okay, signed off on the two commits from your branch and pushed a pgTAP test case to http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/lp1242999_encode_regression_pgtap

Not sure if you've been through the pgTAP process before, but in summary you:

1. Download, make, and install pgTAP on your database server (download from http://pgxn.org/dist/pgtap/ - instructions at pgtap.org)
2. Create the pgTAP extension in your Evergreen database (e.g. "CREATE EXTENSION pgtap;")
3. Run the pgTAP unit & regression tests (e.g. "pg_prove -U evergreen Open-ILS/src/sql/Pg/t Open-ILS/src/sql/Pg/t/regress")

I've tested this on Fedora 19, now I have to give it a shot on Ubuntu Precise...

Revision history for this message
Dan Scott (denials) wrote :

Tested successfully on Ubuntu Precise. Could use more bib records. Different incoming normalizations.

Dan Wells (dbw2)
Changed in evergreen:
milestone: none → 2.5.0
importance: Undecided → High
status: New → Confirmed
Dan Wells (dbw2)
Changed in evergreen:
milestone: 2.5.0 → none
Dan Wells (dbw2)
Changed in evergreen:
assignee: nobody → Dan Wells (dbw2)
Revision history for this message
Dan Wells (dbw2) wrote :

Well, I've spent more time on this today than I'm willing to admit. Everything looks fine, and most of that time was spent in a battle to the death with pgtap (which, as you can guess from the existence of this post, I eventually won).

I was about to push when I noticed we have no upgrade script yet. I am happy to create one, but I am completely out of steam for today. I will continue working on this first thing tomorrow!

Changed in evergreen:
milestone: none → 2.6.0-alpha1
Revision history for this message
Dan Wells (dbw2) wrote :

Upgrade script created, tested, and signed-off, and everything is pushed to master. Since 2.5.2 is getting packaged tomorrow, and I'd like to see this get a bit wider exposure before including it, I am going to target it for backport at 2.5.3.

Should we also backport to 2.4.x? I'd vote yes, but I can also see an argument for not doing so.

Thanks, Dan, for your work on this!

Changed in evergreen:
assignee: Dan Wells (dbw2) → nobody
status: Confirmed → Fix Committed
Revision history for this message
Dan Wells (dbw2) wrote :

This is now backported through rel_2_4. Thanks again, Dan.

Changed in evergreen:
status: Fix Committed → Fix Released
no longer affects: evergreen/2.4
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.