corosync hangs inside libqb

Bug #1341496 reported by Tomasz Kontusz
100
This bug affects 16 people
Affects Status Importance Assigned to Milestone
libqb (Ubuntu)
Fix Released
Undecided
Kick In
Trusty
Fix Released
Medium
Unassigned
Utopic
Won't Fix
Medium
Unassigned

Bug Description

$ lsb_release -rd
Description: Ubuntu 14.04 LTS
Release: 14.04

$ apt-cache policy libqb0
libqb0:
  Installed: 0.16.0.real-1ubuntu3
  Candidate: 0.16.0.real-1ubuntu3
  Version table:
 *** 0.16.0.real-1ubuntu3 0
        500 http://archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages
        100 /var/lib/dpkg/status

Corosync sometimes hangs inside libqb. I've looked at a hanged process with gdb, and I think I've found the problem.
The problem is the loop here: https://github.com/ClusterLabs/libqb/blob/v0.16.0/lib/ringbuffer.c#L451
This was fixed in 0.17.0, see: https://github.com/ClusterLabs/libqb/blob/v0.17.0/lib/ringbuffer.c#L451

I think bumping to 0.17.0 should fix this (at least in backports? Please?)

--------------------------------------------------------------------------
[Impact]

 * libqb does not currently handle ring buffer alloc errors properly. The
   result of this is corosync frequently ending up in an infinite loop
   (consuming 100% cpu) as it continuously tries and fails to allocate
   space from the ringbuffer due to erroneous logic when an attempt to
   reclaim space fails. This patch ensures that when the reclaim fails the
   libqb library gracefully errors out and allows corosync to proceed with
   execution.

 * This is fixed by cherry-picking the following 2 commits:
   - https://github.com/ClusterLabs/libqb/commit/00082df49f045053d03bba7713bfff35d2448459
   - https://github.com/ClusterLabs/libqb/commit/47c690dbbc75957ac2354844b8fbf0a9f4791a87

[Test Case]

There is a test case in comment #2.
A test case that was simple for me to recreate the problem (I used juju to replicate):

1. Deploy a 2 node percona-cluster w/ corosync and pacemaker.
2. Scale the number of units from 2 to 5 nodes.
3. Observe one of the instances of corosync will encounter 100% cpu usage and will not be stuck.

e.g.
juju bootstrap
# install percona-cluster
juju deploy -n 2 cs:trusty/percona-cluster
juju deploy cs:trusty/hacluster

# configure corosync to use unicast for communication
juju set hacluster corosync_transport=udpu

# configure the virtual ip for corosync
juju set percona-cluster vip=<your-vip>

# cause juju to configure the corosync/pacemaker configuration with percona-cluster.
juju add-relation percona-cluster hacluster

# wait for juju debug-log to go quiet.
# then expand the cluster by 3 nodes.
juju add-unit -n 3 percona-cluster

[Regression Potential]

 * As a result of the changes, this may cause a blackbox log entry to be
   dropped or it may cause a ring to be discarded and a new ring to be
   created.

   - If a log entry is dropped, some information may be missing from the
     blackbox used later for analysis. However, upstream has determined
     that missing a log entry is more ideal than hanging the corosync
     process.

   - Rings are discarded as part of the normal corosync communication
     process, and corosync already knows how ot properly handle this
     situation so the risk is small in this area.

Revision history for this message
Claudiu Popescu (claudiu-popescu) wrote :

I had the same issue and eventually I ended up with: https://launchpad.net/~claudiu-popescu/+archive/ubuntu/ppa/+packages
I installed it on a testing cluster and it is working ok for now.

I strongly advise against using this directly in production since it is my first library built for Ubuntu.

Maybe some one will make an official release soon since v0.16.0 is not usable in production environments.

Revision history for this message
Claudiu Popescu (claudiu-popescu) wrote :

How I was able to reproduce the bug:
1. Install and configure a postgresql with streaming replication
2. Install and configure pacemaker and corosync (controlling the postgres cluster)
3. When both servers are operational, something like:
* Node psql1:
    + master-pgsql : 1000
    + pgsql-data-status : LATEST
    + pgsql-master-baseline : 00000000F4000090
    + pgsql-status : PRI
* Node psql2:
    + master-pgsql : 100
    + pgsql-data-status : STREAMING|SYNC
    + pgsql-status : HS:sync
    + pgsql-xlog-loc : 00000000F6030D60

Reboot one of the servers, slave or master.
4. Run: corosync-cmapctl, it will freeze and never return.

I was able to reproduce this every single time wit libqb 0.16.0
With libqb 0.17.0 I was not able to reproduce this scenario.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libqb (Ubuntu):
status: New → Confirmed
Robie Basak (racb)
Changed in libqb (Ubuntu):
assignee: nobody → Kick In (kick-d)
Revision history for this message
Daniel Dehennin (launchpad-baby-gnu) wrote :

Is there a deadline for this fix?

Maybe as a backport of the 0.17.0?

Regards.

Revision history for this message
annunaki2k2 (russell-knighton) wrote :

Just a bump request for this bug, so hopefully it doesn't sting others who wish to run high availability services on Ubuntu!

Revision history for this message
Patrick Domack (patrickdk) wrote :

I just backported the vivid 0.17.0 version to trusty. It runs without issues, and seems to have corrected the problems.

Revision history for this message
Merritt Krakowitzer (merritt) wrote :

Thanks @patrickdk

your backported package resolved this issue for me.

Revision history for this message
Xiang Hui (xianghui) wrote :

Hi guys,

  I didn't find the libqb0 in trusty-backports, anyone can help show me the link or is it still in the progress?

Thanks.

Revision history for this message
Robbie Williamson (robbiew) wrote :

Doesn't appear to be in backports, however there is a backported version for Trusty in this PPA:
https://launchpad.net/~claudiu-popescu/+archive/ubuntu/ppa/+packages

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Attaching debdiff for SRU proposal. The package is the same between trusty and utopic, with the commits already part of the 0.17.1 release in vivid. Please let me know if there's any additional information necessary.

description: updated
Chris J Arges (arges)
Changed in libqb (Ubuntu Trusty):
importance: Undecided → Medium
Changed in libqb (Ubuntu Utopic):
importance: Undecided → Medium
Changed in libqb (Ubuntu Trusty):
status: New → In Progress
Changed in libqb (Ubuntu Utopic):
status: New → In Progress
Revision history for this message
Chris J Arges (arges) wrote :

Sponsored for Trusty.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Tomasz, or anyone else affected,

Accepted libqb into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libqb/0.16.0.real-1ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libqb (Ubuntu):
status: Confirmed → Fix Released
Changed in libqb (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

I set the development task to Fix Released based off the comments that this works with 0.17 which is in Vivid.

Revision history for this message
Matt Rae (mattrae) wrote :

Marking verification-done for trusty based on reports of no longer seeing the corosync 100% cpu issue after applying this update

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libqb - 0.16.0.real-1ubuntu4

---------------
libqb (0.16.0.real-1ubuntu4) trusty; urgency=medium

  [ Billy Olsen ]
  * debian/patches/ringbuffer-reclaim-fix.patch: infinite loop when
    attempting to reclaim space in the ringbuffer fails. (LP: #1341496)

 -- Billy Olsen <email address hidden> Tue, 28 Apr 2015 12:03:15 -0500

Changed in libqb (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for libqb has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Rolf Leggewie (r0lf) wrote :

utopic has seen the end of its life and is no longer receiving any updates. Marking the utopic task for this ticket as "Won't Fix".

Changed in libqb (Ubuntu Utopic):
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.