Swift Erasure Code fails with liberasurecode 1.4.0 on CentOS

Bug #1707220 reported by Andy McCrae
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Won't Fix
Low
Andy McCrae
liberasurecode
Fix Released
Undecided
Unassigned

Bug Description

Our Swift gate tests are failing intermittently on CentOS 7 due to "cross policy write" tests - which are essentially testing cross policy as well as Erasure Code (since the second policy is an EC policy in testing) ( Sample gate failure - http://logs.openstack.org/25/485225/5/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/8ad31e6/console.html#_2017-07-24_19_24_22_216603 )

Manually trying to uploading objects to Swift shows the following:

(swift-untagged) [root@swift-storage1 /]# swift post -H "X-Storage-Policy: ec-tests" ec_cont
(swift-untagged) [root@swift-storage1 /]# swift upload ec_cont test.file
('Connection aborted.', BadStatusLine("''",))
(swift-untagged) [root@swift-storage1 /]# swift post non_ec_cont
(swift-untagged) [root@swift-storage1 /]# swift upload non_ec_cont test.file
test.file

The non-ec container upload works fine, whereas the erasure code upload fails.

The version of liberasurecode deployed is:
(swift-untagged) [root@swift-storage1 /]# rpm -qa | grep liberasurecode
liberasurecode-1.4.0-1.el7.x86_64
liberasurecode-devel-1.4.0-1.el7.x86_64

Updating to 1.5.0 works though:
[root@swift-storage1 /]# wget http://cbs.centos.org/kojifiles/packages/liberasurecode/1.5.0/1.el7/x86_64/liberasurecode-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# wget http://cbs.centos.org/kojifiles/packages/liberasurecode/1.5.0/1.el7/x86_64/liberasurecode-devel-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# rpm -U liberasurecode-devel-1.5.0-1.el7.x86_64.rpm liberasurecode-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# rpm -qa | grep liberasure
liberasurecode-devel-1.5.0-1.el7.x86_64
liberasurecode-1.5.0-1.el7.x86_64

Now after restarting swift services, the upload succeeds:
(swift-untagged) [root@swift-storage1 /]# swift upload ec_cont test.file
test.file

========================

Tested against stable/ocata and Master for Swift.
For reference the CentOS7 kernel being used is:
[root@swift-cent openstack-ansible-os_swift]# uname -r
3.10.0-514.26.2.el7.x86_64

Revision history for this message
David Moreau Simard (dmsimard) wrote :
Revision history for this message
clayg (clay-gerrard) wrote :

The information needed to debug this is in the proxy log lines.

If newer liberasurecode fixes the issue - isn't this bug already "Fix Released"?

Changed in openstack-ansible:
status: New → Incomplete
status: Incomplete → New
Revision history for this message
clayg (clay-gerrard) wrote :

Sorry, I thought this was filed as a libec bug - I don't think I have anything helpful to contribute here - sorry.

Revision history for this message
Tim Burke (1-tim-z) wrote :

It sounds like the proxy worker died trying to service the request, and each time the parent daemon spawned a new one... all the "Removing dead child <pid>" messages like http://logs.openstack.org/25/485225/5/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/8ad31e6/logs/openstack/swift-proxy/swift/proxy-error.log.txt.gz#_Jul_24_19_20_58 seem to confirm that.

Are there any core dumps that get produced?

What's the config for the EC policy? ec_type / ec_num_data_fragments / ec_num_parity_fragments

Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

Thanks for the response Tim - I know its not technically a "libec" or swift issue as such, but would be cool to debug it further (I'm pretty sure we ran into a similar situation last cycle)

Here is the section for the ec-tests storage policy:
[storage-policy:1]
name = ec-tests
policy_type = erasure_coding

ec_type = liberasurecode_rs_vand
ec_num_data_fragments = 3
ec_num_parity_fragments = 2
ec_object_segment_size = 1048576

A couple things to note, on a "not working" install, I can update to liberasurecode-1.5.0 and it works fine (after restarting the services), however newer installs seem to be working with 1.4.0 - also this only seems to impact CentOS7 builds. (Swift settings are the same).

I've done a package comparison between a working build from http://logs.openstack.org/07/488507/1/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/7e6277f/console.html#_2017-07-28_16_35_33_056669 for which I added some debug tasks, and a failed build I have:

[root@swift-storage1 ~]# diff good_rpms.txt bad_rpms.txt
60d59
< gpg-pubkey-e451e5b5-54c22d60

So I don't think there is an issue with different installed packages.

Here is a coredump (or atleast the first 10 lines from the back trace): http://paste.openstack.org/show/616921/

I can get more if that'd help! (It's on liberasurecode-1.4.0-1)

Revision history for this message
Tim Burke (1-tim-z) wrote :

Perfect, that's *exactly* what I needed!

> Program terminated with signal 4, Illegal instruction.

... with the backtrace landing right on a call to ceill -- looks like it matches the problem solved by https://github.com/openstack/liberasurecode/commit/960cdd0 and (more broadly) https://github.com/openstack/liberasurecode/commit/0962144 exactly!

I don't think Zaitcev ever made a liberasurecode bug for it, so I think I'll go ahead and associate this bug but mark it "Fix Released".

Changed in liberasurecode:
status: New → Fix Released
Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

Sweet! Thanks Tim, that should be enough to get a version bump inside of RDO - @dmsimard thoughts? :)

Revision history for this message
David Moreau Simard (dmsimard) wrote :

Just cross referencing the Bugzilla on our end: https://bugzilla.redhat.com/show_bug.cgi?id=1468002

I'm sure we'll update it, it's just a matter of time.

Revision history for this message
Haïkel Guémar (hguemar) wrote :

Updates submitted in RDO repos: https://review.rdoproject.org/r/#/c/8045/
Please pay attention as upstream developper told us to be careful with this update, report any issue you'll find asap.

Changed in openstack-ansible:
assignee: nobody → Andy McCrae (andrew-mccrae)
status: New → In Progress
importance: Undecided → Low
Revision history for this message
David Moreau Simard (dmsimard) wrote :

We'll be able to update eclib to 1.5.0 in RDO once upstream has bumped upper-constraints to 1.5.0 for Ocata. Tim proposed the bump here: https://review.openstack.org/#/c/498521/

Revision history for this message
Jonathan Rosser (jrosser) wrote :

OpenStack-Ansible now only tests ceph on Ubuntu, not CentOS so marking this as WontFix.

Changed in openstack-ansible:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.