Comment 8 for bug 1368737

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: Pacemaker can seg fault on crm node online/standy

Analyzing the stacktrace for stonithd:

(gdb) bt
#0 0x00007fed094febb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed09501fc8 in __GI_abort () at abort.c:89
#2 0x00007fed0a15a6c9 in crm_abort (file=0x7fed0a17e4bb "logging.c",
    function=0x7fed0a17f790 <__PRETTY_FUNCTION__.22958> "crm_glib_handler", line=63,
    assert_condition=0x7fed0af9f2c0 "Source ID 21 was not found when attempting to remove it",
    do_core=<optimized out>, do_fork=<optimized out>) at utils.c:1118
#3 0x00007fed0920fae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4 0x00007fed0920fd72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007fed09207c5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6 0x00007fed09d23ef5 in stonith_action_clear_tracking_data (action=action@entry=0x7fed0afc6b00)
    at st_client.c:536
#7 0x00007fed09d23f2d in stonith_action_destroy (action=0x7fed0afc6b00) at st_client.c:557
#8 0x00007fed0a172cd9 in child_waitpid (child=child@entry=0x7fed0afded70, flags=flags@entry=1)
    at mainloop.c:948
#9 0x00007fed0a172fce in child_death_dispatch (signal=<optimized out>) at mainloop.c:962
#10 0x00007fed0a171de7 in crm_signal_dispatch (source=0x7fed0afb0920, callback=<optimized out>,
    userdata=<optimized out>) at mainloop.c:275
#11 0x00007fed09208e04 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#12 0x00007fed09209048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007fed0920930a in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007fed0a5bd2a9 in main (argc=<optimized out>, argv=<optimized out>) at main.c:1136

Based on this stack trace:

crm_glib_handler -> crm_abort -> abort

I could see one upstream fix that is exactly about this problem (pacemaker mailing list):

http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html

Explaining that this change (in glib):

https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
(seen first at version 2.39.91 - Trusty version is 2.40.2-0ubuntu1)

Caused g_source_remove() (frame #5 in the stacktrace, part of libglib) to misbehave.
(glib is using a hash table lookup to find sources, and not an iterator.. and it is also
returning NULL if source was destroyed)

corosync reports the following error on this occasions:

"""
lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to
record non-fatal assert at logging.c:73 : Source ID 51 was not found when
attempting to remove it
lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found
when attempting to remove it
"""

this is happening because one resource is being removed twice and this
can't be done with newer libglibs.

the following upstream fix handle this problem:

From 568e41db929a34106c8c2ff7c48716ab5c13ef49 Mon Sep 17 00:00:00 2001
From: Andrew Beekhof <email address hidden>
Date: Mon, 13 Oct 2014 13:30:58 +1100
Subject: [PATCH] Fix: lrmd: Prevent glib assert triggered by timers being removed from mainloop more than once

I'll be providing a PPA (soon) with this fix so I can get users/community feedback on the resolution.

Thank you

Rafael Tinoco