Ubuntu
pacemaker package

Bug #1368737
Comment #39

Comment 39 for bug 1368737

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2015-01-20: Re: Pacemaker can seg fault on crm node online/standby

#39

Okay,

So the cherry-pick (for version trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context:

example:

+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;

etc...

This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).

===
Explanation:
g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===

Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:

==============

exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has returned (triggering monitor
callback) with the following data on it:

------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------

OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

if (child->callback) {
child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it", tag);

and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);

Under stonith_action_clear_tracking_data(stonith_action_t * action) we have AGAIN:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
        action->timer_sigterm = 0;
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
        action->timer_sigkill = 0;
    }

This logic probably triggered the same problem the cherry pick addressed for lrmd,
but now for stonith (calling g_source_remove 2 times for the same source after
glib2.0 was changed).

I'll try to discuss this with upstream and confirm that this is happening.

Thank you

Rafael Tinoco

Okay,

example:

+     if (op->opaque->repeat_timer) {
+         g_source_remove(op->opaque->repeat_timer);
++        op->opaque->repeat_timer = 0;

etc...

This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).

Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:

==============

exited = child_waitpid(child, WNOHANG); 
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has returned (triggering monitor 
callback) with the following data on it:

OBS: This means that fence_legacy returned, after checking that 
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

if (child->callback) {
        child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done>                            which calls
0x7f1f6189af10 <stonith_action_destroy>                                    and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data>            generating the 2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it", tag);

and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);

Under stonith_action_clear_tracking_data(stonith_action_t * action) we have AGAIN:

stonith_action_t *action = 0x7f1f639f5b50.

This logic probably triggered the same problem the cherry pick addressed for lrmd, 
but now for stonith (calling g_source_remove 2 times for the same source after 
glib2.0 was changed).

I'll try to discuss this with upstream and confirm that this is happening.

Thank you

Rafael Tinoco

Ubuntupacemaker package

Comment 39 for bug 1368737

Ubuntu
pacemaker package