Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

Bug #1412962 reported by Rafael David Tinoco
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Rafael David Tinoco
Trusty
Fix Released
Undecided
Rafael David Tinoco
Utopic
Fix Released
Undecided
Rafael David Tinoco

Bug Description

[IMPACT]

  - Pacemaker seg fault (stonith and lrmd) because:
      - Newer glib versions uses hash_table to find GSources
      - Glib can try to assert source being removed multiple times

[TEST CASE]

  - Described by user

[REGRESSION POTENTIAL]

  - Based on small fixes made by upstream commits
  - User reports problem has been fixed

[OTHER INFO]

It was brought to my attention the following situation:

"""
lrmd process crashed when repeating "crm node standby" and "crm node online"
It was brought to my attention that pacemaker could seg fault (stonith) on some conditions. This problem
was brought to me when solving the following bug:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/

So you can check the problem here:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/34
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/35
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/36
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/37
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/38

And possible explanation here:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/39
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/comments/40

(Copy and pasting here):

So the cherry-pick (for version trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context:

example:

+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;

etc...

This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).

===
Explanation:
g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===

Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:

==============

exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has returned (triggering monitor callback) with the following data on it:

------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------

OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

    if (child->callback) {
        child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it", tag);

and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);

Under stonith_action_clear_tracking_data(stonith_action_t * action) we have AGAIN:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
        action->timer_sigterm = 0;
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
        action->timer_sigkill = 0;
    }

This logic probably triggered the same problem the cherry pick addressed for lrmd, but now for stonith (calling g_source_remove 2 times for the same source after glib2.0 was changed).

##############

commit 0326f05c9e26f39a394fa30830e31a76306f49c7
Author: Andrew Beekhof <email address hidden>
Date: Thu Aug 7 13:49:24 2014 +1000

    Fix: stonith-ng: Reset mainloop source IDs after removing them

diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
index 64bd8f3..2837682 100644
--- a/lib/fencing/st_client.c
+++ b/lib/fencing/st_client.c
@@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,

     if (action->timer_sigterm > 0) {
         g_source_remove(action->timer_sigterm);
+ action->timer_sigterm = 0;
     }
     if (action->timer_sigkill > 0) {
         g_source_remove(action->timer_sigkill);
+ action->timer_sigkill = 0;
     }

     if (action->last_timeout_signo) {

##############

under <stonith_action_async_done>.

Will provide you a hotfix with this fix and ask for feedback.

Changed in pacemaker (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
summary: - Stonith can seg fault in Trusty and Utopic after following message:
- Source ID XX was not found when attempting to remove it
+ Pacemaker (stonith) can seg fault in Trusty and Utopic after following
+ message: Source ID XX was not found when attempting to remove it
tags: added: cts
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Changed in pacemaker (Ubuntu):
status: New → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Peter,

I have created one PPA to be tested:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1412962

# add-apt-repository ppa:inaddy/lp1412962
# apt-get update
# apt-get install pacemaker

The right package version, for now, will be:

1.1.10+git20130802-1ubuntu2.3~lp1412962~1 (for Trusty)

And they are going to be replaced by the appropriate version in case the
stable release update proposal is accepted into -updates repository.

Please provide me feedback regarding the fix (if it solved the problem
for you).

Thank you very much

Rafael Tinoco

Revision history for this message
Peter Parzer (peter-parzer) wrote :

Hi Rafael,

the PPA solved the stonith crashes on my testing system. I still get an lrmd crash (core attached) corresponding to the crm_abort error:

Jan 23 08:57:30 kjpnode1 lrmd[1363]: error: crm_abort: crm_glib_handler: Forked child 5341 to record non-fatal assert at logging.c:63 : Source ID 849 was not found when attempting to remove it

Maybe next week I get the possibility to test it on the production system.

Peter

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Peter,

Downloading the core dump to analyze it... tks for uploading it.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Peter,

Could you test version ~2 ? I'm uploading it to the PPA right now.
I backported some other fixes regarding the same issue.
Waiting on your feedback.
Thanks for reporting!!!

Rafael Tinoco

Revision history for this message
Peter Parzer (peter-parzer) wrote :

Hi Rafael,

with the last version I did not get any crashes. Also the error messages from crm_abort are gone. I tested it on both my testing cluster and the production system.

Many Thanks!

Peter

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Great, good to read that. I'll provide the SRU proposal then. Let me know here if you face any issue.

Thank you for reporting the problem and providing the cores so I could investigate.

;)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Trusty and Utopic are affected and Precise is NOT.

 libglib2.0-0 | 2.24.0-0ubuntu4 | lucid | amd64, armel, i386, ia64, powerpc, sparc
 libglib2.0-0 | 2.24.1-0ubuntu2 | lucid-updates | amd64, armel, i386, ia64, powerpc, sparc
 libglib2.0-0 | 2.32.1-0ubuntu2 | precise | amd64, armel, armhf, i386, powerpc
 libglib2.0-0 | 2.32.4-0ubuntu1 | precise-updates | amd64, armel, armhf, i386, powerpc
 libglib2.0-0 | 2.40.0-2 | trusty | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.40.2-0ubuntu1 | trusty-updates | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.41.5-1 | ubuntu-rtm/14.09 | amd64, armhf, i386
 libglib2.0-0 | 2.42.0-2 | utopic | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.42.0-2 | vivid | amd64, arm64, armhf, i386, powerpc, ppc64el

since libglib2 started using the hashtable lookup in version 2.39.91 and precise is still
using libglib2 2.32.4.

Checking Vivid...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attaching fix for Trusty

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attaching fix for Utopic

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attaching fix for Vivid (2 patches only, not 3)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Subscribing sponsors-team. Asking sponsorship.

description: updated
Chris J Arges (arges)
Changed in pacemaker (Ubuntu Trusty):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in pacemaker (Ubuntu Utopic):
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Chris J Arges (arges) wrote :

Sponsored for Vivid/Utopic/Trusty.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.11-1ubuntu2

---------------
pacemaker (1.1.11-1ubuntu2) vivid; urgency=medium

  * Fix: stonith-ng: Reset mainloop source IDs after removing them (LP: #1412962)
  * Fix: services: Remove cancelled recurring ops from internal
    lists as early as possible (LP: #1412962)
 -- Rafael David Tinoco (Inaddy) <email address hidden> Wed, 04 Feb 2015 15:21:30 -0600

Changed in pacemaker (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted pacemaker into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-4ubuntu3.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in pacemaker (Ubuntu Utopic):
status: New → Fix Committed
tags: added: verification-needed
Changed in pacemaker (Ubuntu Trusty):
status: New → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Rafael, or anyone else affected,

Accepted pacemaker into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Peter,

Can you help verifying this package from -proposed ?

After installing the package from -proposed, could you check if your cluster is:

1) operational
2) resources are ok
3) fencing correctly
4) no more segfaults from lrmd and/or stonith

Waiting for feedback so we can change this to "verification-done".

Thank you

Rafael Tinoco

Revision history for this message
Peter Parzer (peter-parzer) wrote :

Rafael,

with trusty-proposed everything works fine on my testing system. Give me some days to check it also on our production system.

Peter

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Rafael,

Ran with this yesterday and overnight, problems seem to be resolved. Thanks!

- Billy

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Peter,

We are getting feedback from others saying that this indeed fixed their setups also.
Mind if I consider this "verification-done" since it would be good to push such a fix soon.

Tks

Tinoco

Revision history for this message
Peter Parzer (peter-parzer) wrote :

Rafael,

since the latest version from your PPA already solved the problem on my production system I think you could set this bug to "verification-done".

Peter

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Awesome Peter,

Thank you very much for reporting, testing and re-testing this for me.

Best Regards

Tinoco

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-1ubuntu2.3

---------------
pacemaker (1.1.10+git20130802-1ubuntu2.3) trusty; urgency=medium

  * Fix: stonith-ng: Reset mainloop source IDs after removing them (LP: #1412962)
  * Fix: services: Fixes segfault associated with cancelling in-flight
    recurring operations (LP: #1412962)
  * Fix: services: Remove cancelled recurring ops from internal
    lists as early as possible (LP: #1412962)
 -- Rafael David Tinoco (Inaddy) <email address hidden> Wed, 04 Feb 2015 15:51:09 -0600

Changed in pacemaker (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-4ubuntu3.2

---------------
pacemaker (1.1.10+git20130802-4ubuntu3.2) utopic; urgency=medium

  * Fix: stonith-ng: Reset mainloop source IDs after removing them (LP: #1412962)
  * Fix: services: Fixes segfault associated with cancelling in-flight
    recurring operations (LP: #1412962)
  * Fix: services: Remove cancelled recurring ops from internal
    lists as early as possible (LP: #1412962)
 -- Rafael David Tinoco (Inaddy) <email address hidden> Wed, 04 Feb 2015 15:44:28 -0600

Changed in pacemaker (Ubuntu Utopic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.