Pacemaker (lrmd) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

Bug #1368737 reported by Rafael David Tinoco
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
High
Rafael David Tinoco
Trusty
Fix Released
High
Unassigned
Utopic
Fix Released
High
Unassigned
Vivid
Fix Released
High
Rafael David Tinoco

Bug Description

[IMPACT]

  - Pacemaker seg fault on repeated crm node online/standy because:
      - Newer glib versions uses hash_table to find GSources
      - Glib can try to assert source being removed multiple times

[TEST CASE]

  - Using same configuration as attached cib.xml :

        #!/bin/bash

        while true; do
            crm node standby clustertrusty01
            sleep 7
            crm node online clustertrusty01
            sleep 7
            crm node standby clustertrusty02
            sleep 7
            crm node online clustertrusty02
            sleep 7
            crm node standby clustertrusty03
            sleep 7
            crm node online clustertrusty03
            sleep 7
        done

[REGRESSION POTENTIAL]

  - Based on upstream commit 568e41d
  - Test case ran for more than 7 hours with no problems

[OTHER INFO]

It was brought to my attention the following situation:

"""
[Issue]

lrmd process crashed when repeating "crm node standby" and "crm node online"

----------------
# grep pacemakerd ha-log.k1pm101 | grep core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49275 (lrmd) dumped core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49275, core=1)
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 1471 (lrmd) dumped core
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=1471, core=1)
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35771 (lrmd) dumped core
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35771, core=1)
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 60709 (lrmd) dumped core
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=60709, core=1)
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35838 (lrmd) dumped core
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35838, core=1)
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49249 (lrmd) dumped core
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49249, core=1)
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 65358 (lrmd) dumped core
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=65358, core=1)
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 22693 (lrmd) dumped core
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=22693, core=1)
----------------

----------------
# grep pacemakerd ha-log.k1pm102 | grep core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 5812 (lrmd) dumped core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=5812, core=1)
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 35781 (lrmd) dumped core
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35781, core=1)
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 51984 (lrmd) dumped core
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=51984, core=1)
"""

Analyzing core file with dbgsyms I could see that:

#0 0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at services.c:434
434 crm_trace(" > stdout: %s", op->stdout_data);

Is responsible for the core.

I've checked upstream code and there might be 2 important commits that could be cherry-picked to fix this behavior:

commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
Author: Andrew Beekhof <email address hidden>
Date: Fri Sep 20 12:20:36 2013 +1000

    Fix: services: Prevent use-of-NULL when executing service actions

commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
Author: Gao,Yan <email address hidden>
Date: Sun Sep 29 12:40:18 2013 +0800

    Fix: services: Fix the executing of synchronous actions

The core can be caused by things such as this missing code:

if (op == NULL) {
crm_trace("No operation to execute");
return FALSE;

on the beginning of "services_action_sync(svc_action_t * op)" function.

And improved by commit #11473a5.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

There is already a Fix Released for Utopic:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1353473

And Trusty's fix is waiting to get released.

This way I'm working on the patch on the topic of another suggested SRU:

pacemaker (1.1.10+git20130802-1ubuntu3) trusty; urgency=medium

  * Fix: services: Do not allow duplicate recurring op entries - 1/3 (LP: #1353473)
  * High: lrmd: Merge duplicate recurring monitor operations - 2/3 (LP: #1353473)
  * Fix: lrmd: Cancel recurring operations before stop action is executed - 3/3 (LP: #1353473)

 -- Rafael David Tinoco <email address hidden> Wed, 06 Aug 2014 09:24:13 -0300

Revision history for this message
Nobuto Murata (nobuto) wrote :

Hmm, I can get a crash file even with the PPA. I will upload a crash file with retrace and procedure I tried. If you think it's an another issue, I will open a new bug.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Also, I used an attached juju bundle to create the testbed.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in pacemaker (Ubuntu):
status: New → Confirmed
tags: added: cts
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay,

Finally I was able to reproduce the problem using:

#!/bin/bash

while true; do

 crm node standby clustertrusty01
 sleep 7
 crm node online clustertrusty01
 sleep 7

 crm node standby clustertrusty02
 sleep 7
 crm node online clustertrusty02
 sleep 7

 crm node standby clustertrusty03
 sleep 7
 crm node online clustertrusty03
 sleep 7

done

And the attached CIB (cib.xml).

 could generate the following crashes:

root@clustertrusty01:/var/crash# ls -lah -1 | grep usr
-rw-r----- 1 root root 257K Oct 30 01:18 _usr_lib_pacemaker_lrmd.0.crash
-rw-r----- 1 root root 434K Oct 30 01:18 _usr_lib_pacemaker_stonithd.0.crash

root@clustertrusty02:/var/crash# ls -lah -1 | grep usr
-rw-r----- 1 root root 257K Oct 30 01:20 _usr_lib_pacemaker_lrmd.0.crash
-rw-r----- 1 root root 441K Oct 30 01:21 _usr_lib_pacemaker_stonithd.0.crash

root@clustertrusty03:/var/crash# ls -lah -1 | grep usr
-rw-r----- 1 root root 258K Oct 30 01:21 _usr_lib_pacemaker_lrmd.0.crash
-rw-r----- 1 root root 434K Oct 30 01:18 _usr_lib_pacemaker_stonithd.0.crash

root@clustertrusty04:/var/crash# ls -lah -1 | grep usr
-rw-r----- 1 root root 443K Oct 30 01:21 _usr_lib_pacemaker_stonithd.0.crash

Changed in pacemaker (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.1 KiB)

Analyzing the stacktrace for stonithd:

(gdb) bt
#0 0x00007fed094febb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fed09501fc8 in __GI_abort () at abort.c:89
#2 0x00007fed0a15a6c9 in crm_abort (file=0x7fed0a17e4bb "logging.c",
    function=0x7fed0a17f790 <__PRETTY_FUNCTION__.22958> "crm_glib_handler", line=63,
    assert_condition=0x7fed0af9f2c0 "Source ID 21 was not found when attempting to remove it",
    do_core=<optimized out>, do_fork=<optimized out>) at utils.c:1118
#3 0x00007fed0920fae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4 0x00007fed0920fd72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007fed09207c5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6 0x00007fed09d23ef5 in stonith_action_clear_tracking_data (action=action@entry=0x7fed0afc6b00)
    at st_client.c:536
#7 0x00007fed09d23f2d in stonith_action_destroy (action=0x7fed0afc6b00) at st_client.c:557
#8 0x00007fed0a172cd9 in child_waitpid (child=child@entry=0x7fed0afded70, flags=flags@entry=1)
    at mainloop.c:948
#9 0x00007fed0a172fce in child_death_dispatch (signal=<optimized out>) at mainloop.c:962
#10 0x00007fed0a171de7 in crm_signal_dispatch (source=0x7fed0afb0920, callback=<optimized out>,
    userdata=<optimized out>) at mainloop.c:275
#11 0x00007fed09208e04 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#12 0x00007fed09209048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#13 0x00007fed0920930a in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007fed0a5bd2a9 in main (argc=<optimized out>, argv=<optimized out>) at main.c:1136

Based on this stack trace:

crm_glib_handler -> crm_abort -> abort

I could see one upstream fix that is exactly about this problem (pacemaker mailing list):

http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html

Explaining that this change (in glib):

https://github.com/GNOME/glib/commit/393503ba5bdc7c09cd46b716aaf3d2c63a6c7f9c
(seen first at version 2.39.91 - Trusty version is 2.40.2-0ubuntu1)

Caused g_source_remove() (frame #5 in the stacktrace, part of libglib) to misbehave.
(glib is using a hash table lookup to find sources, and not an iterator.. and it is also
returning NULL if source was destroyed)

corosync reports the following error on this occasions:

"""
lrmd[1632]: error: crm_abort: crm_glib_handler: Forked child 1840 to
record non-fatal assert at logging.c:73 : Source ID 51 was not found when
attempting to remove it
lrmd[1632]: crit: crm_glib_handler: GLib: Source ID 51 was not found
when attempting to remove it
"""

this is happening because one resource is being removed twice and this
can't be done with newer libglibs.

the following upstream fix handle this problem:

From 568e41db929a34106c8c2ff7c48716ab5c13ef49 Mon Sep 17 00:00:00 2001
From: Andrew Beekhof <email address hidden>
Date: Mon, 13 Oct 2014 13:30:58 +1100
Subject: [PATCH] Fix: lrmd: Prevent glib assert triggered by timers being removed from mainloop more than once

I'll be providing a PPA (soon) with this fix so I can get users/community feedback on the resolution.

Thank ...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Trusty is affected and Precise is NOT.

 libglib2.0-0 | 2.24.0-0ubuntu4 | lucid | amd64, armel, i386, ia64, powerpc, sparc
 libglib2.0-0 | 2.24.1-0ubuntu2 | lucid-updates | amd64, armel, i386, ia64, powerpc, sparc
 libglib2.0-0 | 2.32.1-0ubuntu2 | precise | amd64, armel, armhf, i386, powerpc
 libglib2.0-0 | 2.32.4-0ubuntu1 | precise-updates | amd64, armel, armhf, i386, powerpc
 libglib2.0-0 | 2.40.0-2 | trusty | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.40.2-0ubuntu1 | trusty-updates | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.41.5-1 | ubuntu-rtm/14.09 | amd64, armhf, i386
 libglib2.0-0 | 2.42.0-2 | utopic | amd64, arm64, armhf, i386, powerpc, ppc64el
 libglib2.0-0 | 2.42.0-2 | vivid | amd64, arm64, armhf, i386, powerpc, ppc64el

since libglib2 started using the hashtable lookup in version 2.39.91 and precise is still
using libglib2 2.32.4.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Running testcase for some time and couldn't get any core dump...

Services seem stable:

Every 1.0s: crm_mon -1 Fri Oct 31 00:52:57 2014

Last updated: Fri Oct 31 00:52:57 2014
Last change: Fri Oct 31 00:31:22 2014 via crm_attribute on clustertrusty04
Stack: corosync
Current DC: clustertrusty02 (12) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
6 Resources configured

Node clustertrusty02 (12): standby
Online: [ clustertrusty01 clustertrusty03 clustertrusty04 ]

 fenceclustertrusty01 (stonith:fence_virsh): Started clustertrusty04
 fenceclustertrusty02 (stonith:fence_virsh): Started clustertrusty03
 fenceclustertrusty03 (stonith:fence_virsh): Started clustertrusty01
 fenceclustertrusty04 (stonith:fence_virsh): Started clustertrusty01
 Resource Group: postfix
     vippostfix (ocf::heartbeat:IPaddr2): Started clustertrusty01
     initpostfix (lsb:postfix): Started clustertrusty01

At this time...

stonith_action_clear_tracking_data is calling g_source_remove and
there are no problems, even when trying to remove an already
removed timer.

Judging by the developer comments on that:

"""
The glib behaviour on unbuntu seems reasonable, removing a source multiple times IS a valid error.
I need the stack trace to know where/how this situation can occur in pacemaker.
"""

Those error messages from glib (not being able to remove the resource),
that are still there :

"""
Oct 31 00:30:20 [2054] clustertrusty03 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 2 197 to record non-fatal assert at logging.c:63 : Source ID 15 was not found when attempting to remove it
"""

Can be interpreted as normal and "non-fatal".

Changed in pacemaker (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I'm asking for sponsorship for this...

Meanwhile I have created one PPA to be used:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1368737

# add-apt-repository ppa:inaddy/lp1368737
# apt-get update
# apt-get install pacemaker

The right package version, for now, will be:

1.1.10+git20130802-1ubuntu2.2~lp1368737~1 (for Trusty)
1.1.10+git20130802-4ubuntu4~lp1368737~1 (for Utopic)

And they are going to be replaced by the appropriate version in case the
stable release update proposal is accepted into -updates repository.

Please provide me feedback regarding the fix (if it solved the problem
for you).

Thank you very much

Rafael Tinoco

description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Considering bug:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1382842

I'll have to fix dependencies together with this SRU.

Please hold while I fix this new debdiff (for this case), fixing lib dependencies for pacemaker to be upgraded.

Thank you

Rafael Tinoco

summary: - Pacemaker can seg fault on crm node online/standy
+ Pacemaker can seg fault on crm node online/standby
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: Pacemaker can seg fault on crm node online/standby

It looks like the format chosen for SRU for this package :

pacemaker (1.1.10+git20130802-1ubuntu2.1) trusty
pacemaker (1.1.10+git20130802-1ubuntu2) trusty
pacemaker (1.1.10+git20130802-1ubuntu1) saucy

makes dh helpers not to calculate shlibs version properly:

$ fakeroot dh_makeshlibs -a -V

$ find . | grep shlib
./debian/libcrmcommon3/DEBIAN/shlibs
./debian/libpengine4/DEBIAN/shlibs
./debian/libpe-status4/DEBIAN/shlibs
./debian/libcrmservice1/DEBIAN/shlibs
./debian/libtransitioner2/DEBIAN/shlibs
./debian/libcib3/DEBIAN/shlibs
./debian/libpe-rules2/DEBIAN/shlibs
./debian/libcrmcluster4/DEBIAN/shlibs
./debian/libstonithd2/DEBIAN/shlibs
./debian/liblrmd1/DEBIAN/shlibs

$ cat ./debian/libcib3/DEBIAN/shlibs
libcib 3 libcib3 (>= 1.1.10+git20130802)

When the requested version, considering the following line in debian/control:

Package: pacemaker
Architecture: any
Depends: ${shlibs:Depends}, ${misc:Depends}, ${python:Depends}, ...

Should be:
libcib 3 libcib3 (>= 1.1.10+git20130802-1ubuntu2.2).

I'm fixing this "problem" together with this BUG so next "pacemaker" upgrade can have its dependencies properly calculated.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

The way this package's versioning was made makes the tool "dh_makeshlibs" (debian helper) not to append proper suffix to dependencies (using (>= 1.1.10+git20130802) instead of (>= 1.1.10+git20130802-1ubuntu2.1) for example).

I changed debian/rules so the proper version is considered for dependencies:

        dh_makeshlibs -a -V
        dh_makeshlibs -plibcrmcommon3 -V "libcrmcommon3 (>= $(RELEASE))"
        dh_makeshlibs -plibpengine4 -V "libpengine4 (>= $(RELEASE))"
        dh_makeshlibs -plibpe-status4 -V "libpe-status4 (>= $(RELEASE))"
        dh_makeshlibs -plibcrmservice1 -V "libcrmservice1 (>= $(RELEASE))"
        dh_makeshlibs -plibtransitioner2 -V "libtransitioner2 (>= $(RELEASE))"
        dh_makeshlibs -plibcib3 -V "libcib3 (>= $(RELEASE))"
        dh_makeshlibs -plibpe-rules2 -V "libpe-rules2 (>= $(RELEASE))"
        dh_makeshlibs -plibcrmcluster4 -V "libcrmcluster4 (>= $(RELEASE))"
        dh_makeshlibs -plibstonithd2 -V "libstonithd2 (>= $(RELEASE))"
        dh_makeshlibs -pliblrmd1 -V "liblrmd1 (>= $(RELEASE))"

And I recommend further releases not to use this "versioning" scheme, to guarantee dh_makeshlibs to work properly.

Thank you

Rafael Tinoco

PS: attaching debdiffs in a few moments.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Trusty fix.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Utopic fix.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I recommend, if possible, Vivid to use 1.1.12 (from upstream) and to use a different versioning scheme.

Asking for sponsorship.

Thank you

Rafael Tinoco

James Page (james-page)
Changed in pacemaker (Ubuntu Vivid):
importance: Undecided → High
Changed in pacemaker (Ubuntu Utopic):
importance: Undecided → High
Changed in pacemaker (Ubuntu Trusty):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.11-1ubuntu1

---------------
pacemaker (1.1.11-1ubuntu1) vivid; urgency=medium

  * Merge from Debian experimental, remaining changes:
    - d/control: Build-Depends on libcfg-dev.
    - Corosync's pacemaker plugin is disabled, hence not built:
      + d/libcrmcluster4-dev.install: Do not install plugin.h.
      + d/pacemaker.install,rules: Do not install pacemaker.lcrso.
    - Cherry picked fixes from upstream VCS:
      + Fix: services: Do not allow duplicate recurring op entries
      + High: lrmd: Merge duplicate recurring monitor operations
      + Fix: lrmd: Cancel recurring operations before stop action is executed
    - d/p/*: Refreshed.
  * Dropped changes, included in Debian:
    - d/control: Depend on libheartbeat2.
  * d/control: Ensure that pacemaker binary package uses matched binary
    versions of pacemaker libraries, avoiding upgrade problems (LP: #1382842).
  * d/p/Fix-lrmd-Prevent-glib-assert-triggered-by-timers-being.patch:
    Prevent glib assert triggered by timers being removed from mainloop more
    than once, cherry picked from upstream VCS (LP: #1368737).
  * d/control,compat,rules: Bump debhelper compat level to 9, drop use of
    hardening-wrapper.
  * d/control: Bumped Standards-Version, no changes.

pacemaker (1.1.11-1) experimental; urgency=medium

  * New upstream release
  * debian/patches/series: Disable 5 patches added for the last release
    as these are included in the upstream source now
  * debian/control: Bump Standards-Version
 -- James Page <email address hidden> Thu, 18 Dec 2014 10:26:17 +0000

Changed in pacemaker (Ubuntu Vivid):
status: In Progress → Fix Released
Revision history for this message
James Page (james-page) wrote :

utopic and trusty fixes uploaded for sru team review.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted pacemaker into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in pacemaker (Ubuntu Trusty):
status: New → Fix Committed
tags: added: verification-needed
Changed in pacemaker (Ubuntu Utopic):
status: New → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Rafael, or anyone else affected,

Accepted pacemaker into utopic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-4ubuntu3.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Peter Parzer (peter-parzer) wrote : Re: Pacemaker can seg fault on crm node online/standby

Hi Brian,

this fix did not solve the bug for me. I still get the following error message every 2 minutes:

Jan 7 08:28:25 kjp02 stonith-ng[1868]: error: crm_abort: crm_glib_handler: Forked child 4647 to record non-fatal assert at logging.c:63 : Source ID 28 was not found when attempting to remove it

Since it is an production system I can not play with crm standby/online, but I suspect that it will crash again as it did before, because this error was the cause for all previous crashes.

I use 14.04.1 amd64, 3.13.0-43-generic

# apt-cache policy pacemaker
pacemaker:
  Installiert: 1.1.10+git20130802-1ubuntu2.2
  Installationskandidat: 1.1.10+git20130802-1ubuntu2.2
  Versionstabelle:
 *** 1.1.10+git20130802-1ubuntu2.2 0
        400 http://archive.ubuntu.com/ubuntu/ trusty-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     1.1.10+git20130802-1ubuntu2.1 0
        500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     1.1.10+git20130802-1ubuntu2 0
        500 http://archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

Peter

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Per comment #10:

"
Those error messages from glib (not being able to remove the resource), that are still there :

Oct 31 00:30:20 [2054] clustertrusty03 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 2 197 to record non-fatal assert at logging.c:63 : Source ID 15 was not found when attempting to remove it

Can be interpreted as normal and "non-fatal".
"

Upstream discussion (explaining behavior, fix and error messages) can be found here:

http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/thread.html#22690

If you, together with error messages (non fatal and expected on some conditions like the ones we can trigger here with this testcase) face a crash, please attach the crash report to this case.

Thank you very much

Best Regards

Rafael Tinoco

Revision history for this message
Peter Parzer (peter-parzer) wrote :
Download full text (3.8 KiB)

I also experience crashes of stonithd, alone 2 times yesterday, always on both nodes at the same time. Here is the stack trace:

root@kjp03:/var/crash# apport-retrace -Rs _usr_lib_pacemaker_stonithd.0.crash
E: Can not find version '1.1.10+git20130802-1ubuntu2.2' of package 'pacemaker'
E: Quellpaket für pacemaker kann nicht gefunden werden.
--- stack trace ---
#0 0x00007ffa6f17abb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 40008
        selftid = 40008
#1 0x00007ffa6f17dfc8 in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 17179869185, 140713634797360, 140713634496512, 0, 140734633026224, 140713582943175, 140713586093704, 140734633026160, 397168, 32, 140713586088608, 0, 140713586088608, 140713582942786, 140713579551566}}, sa_flags = 1876903824, sa_restorer = 0x3f}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007ffa6fdcf6c9 in crm_abort (file=0x7ffa6fdf34bb "logging.c", function=0x7ffa6fdf4790 <__PRETTY_FUNCTION__.22958> "crm_glib_handler", line=63, assert_condition=0x7ffa72376ce0 "Source ID 541 was not found when attempting to remove it", do_core=<optimized out>, do_fork=<optimized out>) at utils.c:1118
        rc = 0
        pid = <optimized out>
        status = 0
        __func__ = "crm_abort"
#3 0x00007ffa6ee8bae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#4 0x00007ffa6ee8bd72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#5 0x00007ffa6ee83c5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#6 0x00007ffa6f999ef5 in stonith_action_clear_tracking_data (action=action@entry=0x7ffa723350b0) at st_client.c:536
No locals.
#7 0x00007ffa6f999f2d in stonith_action_destroy (action=0x7ffa723350b0) at st_client.c:557
No locals.
#8 0x00007ffa6fde7cd9 in child_waitpid (child=child@entry=0x7ffa7236bb20, flags=flags@entry=1) at mainloop.c:948
        rc = <optimized out>
        core = <optimized out>
        signo = 0
        status = 0
        exitcode = 0
        __func__ = "child_waitpid"
#9 0x00007ffa6fde7fce in child_death_dispatch (signal=<optimized out>) at mainloop.c:962
        saved = 0x0
        child = 0x7ffa7236bb20
        iter = 0x7ffa7222d200
        exited = <optimized out>
        __func__ = "child_death_dispatch"
#10 0x00007ffa6fde6de7 in crm_signal_dispatch (source=0x7ffa7236ba50, callback=<optimized out>, userdata=<optimized out>) at mainloop.c:275
        __func__ = "crm_signal_dispatch"
#11 0x00007ffa6ee84e04 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#12 0x00007ffa6ee85048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#13 0x00007ffa6ee8530a in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
No symbol table info available.
#14 0x00007ffa702282a9 in main (argc=<optimized out>, argv=<optimized out>) at main.c:1136
        flag = <optimized out>
        lpc = 0
        argerr = <optimize...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Could you provide your cluster configuration (cib file with configured stonith resources and parameters) and all packages versions (dpkg -l) ? I'll try to reproduce what you are facing.. tks

Revision history for this message
Peter Parzer (peter-parzer) wrote :

The cluster consists of 2 HP ProLiant DL120 G7 Rack-Server as file server with DRBD and Samba. I used the same configuration with 12.04 for 2 years without any problems.

The cluster configuration:

node $id="167772161" kjp02 \
 attributes standby="off"
node $id="167772162" kjp03 \
 attributes standby="off"
primitive drbd ocf:linbit:drbd \
 params drbd_resource="srv" \
 op monitor interval="29" role="Master" \
 op monitor interval="31" role="Slave"
primitive ip ocf:heartbeat:IPaddr2 \
 params ip="161.42.184.40" \
 op monitor interval="30" \
 meta target-role="Started"
primitive mail ocf:heartbeat:MailTo \
 params email="root" \
 meta target-role="Started"
primitive nmb upstart:nmbd \
 op monitor interval="60" \
 meta target-role="Started"
primitive quota lsb:quota \
 op monitor interval="60" \
 op start timeout="300" interval="0" \
 meta target-role="Started"
primitive smb upstart:smbd \
 op monitor interval="60" \
 meta target-role="Started"
primitive srv ocf:heartbeat:Filesystem \
 op monitor interval="60" \
 params device="/dev/drbd0" directory="/srv" fstype="ext4" options="noatime,acl,usrquota,user_xattr" \
 meta target-role="Started"
primitive st_kjp02 stonith:external/ipmi \
 params hostname="kjp02" ipaddr="161.42.184.42" userid="Administrator" passwd="***" interface="lanplus" \
 op monitor interval="120"
primitive st_kjp03 stonith:external/ipmi \
 params hostname="kjp03" ipaddr="161.42.184.44" userid="Administrator" passwd="***" interface="lanplus" \
 op monitor interval="120"
primitive winbind upstart:winbind \
 op monitor interval="60" \
 meta target-role="Started"
ms drbd_ms drbd \
 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location st_kjp02_loc st_kjp02 -inf: kjp02
location st_kjp03_loc st_kjp03 -inf: kjp03
colocation ip_srv inf: ip srv
colocation mail_ip inf: mail ip
colocation nmb_ip inf: nmb ip
colocation quota_srv inf: quota srv
colocation smb_winbind inf: smb winbind
colocation srv_drbd inf: srv drbd_ms:Master
colocation winbind_ip inf: winbind ip
order drbd_srv inf: drbd_ms:promote srv:start
order ip_mail inf: ip mail
order ip_nmb inf: ip nmb
order ip_winbind inf: ip winbind
order srv_ip inf: srv:start ip
order srv_quota inf: srv:start quota
order winbind_smb inf: winbind smb
property $id="cib-bootstrap-options" \
 dc-version="1.1.10-42f2063" \
 cluster-infrastructure="corosync" \
 no-quorum-policy="ignore" \
 last-lrm-refresh="1416995137"

dpkg versions are attached.

Peter

Revision history for this message
Peter Parzer (peter-parzer) wrote :

I set up a testing environment with 2 VMs, trusty-proposed enabled, all updates installed and the following minimal cluster configuration:

node $id="168427521" kjpnode1 \
 attributes standby="off"
node $id="168427522" kjpnode2 \
 attributes standby="on"
primitive st_kjpnode1 stonith:external/ssh \
 params hostlist="kjpnode1"
primitive st_kjpnode2 stonith:external/ssh \
 params hostlist="kjpnode2"
location st_kjpnode1_loc st_kjpnode1 -inf: kjpnode1
location st_kjpnode2_loc st_kjpnode2 -inf: kjpnode2
property $id="cib-bootstrap-options" \
 dc-version="1.1.10-42f2063" \
 cluster-infrastructure="corosync" \
 no-quorum-policy="ignore"

I still get core dumps of stonithd.

Peter

Revision history for this message
Peter Parzer (peter-parzer) wrote :
Revision history for this message
Peter Parzer (peter-parzer) wrote :

If I configure the test environment like my production system (with the exception of the stonith agent, of course) I get additional core dumps of lrmd.

Peter

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay,

I'm revisiting this today. Tks for the crash, I'll try to reproduce what you are getting also.

Tinoco

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Peter,

(1) During the test execution, does using more then 2 nodes AND/OR changing "no-quorum-policy" to something else (freeze, stop, suicide) does help ?

(2) Your crash files do not contain the core file, could you please provide me the core file (probably changing ulimit inside /etc/security/limits.conf to allow core file generation).

I'm executing the testcase again on a 4-node lab machine and still no dump (only difference is that I'm running the test on a 4 node setup with quorum policy=stop and ssh fencing (ssh fencing just like your last attempt)).

I'm also curious about this:
ii pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1 amd64 Command line interface utilities for Pacemaker

On the package update, pacemaker-cli-utils did not get update together ?

Thank you

Rafael

PS: Meanwhile my testcase is running to see if I can trigger this problem again (before the patch I was getting core files in between each 10 minutes).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.7 KiB)

For now,

I have tested the following scenarios:

- 4 nodes
- stonith-enabled=true
- no-quorum-policy=stop

AND

- 2 nodes only
- stonith-enabled=true
- no-quorum-policy=ignore

I ran the test case (bug description) for hours and could not get a crash, although I do get the following messages (expected) from time to time:

Jan 19 16:52:23 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 12418 to record non-fatal assert at logging.c:63 : Source ID 73 was not found when attempting to remove it
Jan 19 16:52:23 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 12439 to record non-fatal assert at logging.c:63 : Source ID 74 was not found when attempting to remove it
Jan 19 16:52:38 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 13413 to record non-fatal assert at logging.c:63 : Source ID 76 was not found when attempting to remove it
Jan 19 16:52:38 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 13429 to record non-fatal assert at logging.c:63 : Source ID 77 was not found when attempting to remove it
Jan 19 16:52:52 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 14411 to record non-fatal assert at logging.c:63 : Source ID 79 was not found when attempting to remove it
Jan 19 16:52:52 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 14423 to record non-fatal assert at logging.c:63 : Source ID 80 was not found when attempting to remove it
Jan 19 16:53:07 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15410 to record non-fatal assert at logging.c:63 : Source ID 82 was not found when attempting to remove it
Jan 19 16:53:07 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15427 to record non-fatal assert at logging.c:63 : Source ID 83 was not found when attempting to remove it
Jan 19 16:53:21 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 16409 to record non-fatal assert at logging.c:63 : Source ID 85 was not found when attempting to remove it
Jan 19 16:53:21 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 16425 to record non-fatal assert at logging.c:63 : Source ID 86 was not found when attempting to remove it
Jan 19 16:53:35 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 17408 to record non-fatal assert at logging.c:63 : Source ID 88 was not found when attempting to remove it
Jan 19 16:53:35 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 17420 to record non-fatal assert at logging.c:63 : Source ID 89 was not found when attempting to remove it
Jan 19 16:53:50 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 18409 to record non-fatal assert at logging.c:63 : Source ID 91 was not found when attempting to remove it
Jan 19 16:53:50 [941] trusty01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 18422 to record non-fatal assert at logg...

Read more...

Revision history for this message
Peter Parzer (peter-parzer) wrote :

Here is the crash file from my virtual testing environment, including core. I updated the nodes today with all current packages from trusty-proposed. The core dump happened just after restarting the nodes after the updates.

Revision history for this message
Peter Parzer (peter-parzer) wrote :

And this is the current cluster configuration of the testing environment.

Revision history for this message
Peter Parzer (peter-parzer) wrote :

And here is the last core dump of my production system from today in the morning, also core included.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.8 KiB)

Okay,

So the cherry-pick (for version trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context:

example:

+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;

etc...

This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).

===
Explanation:
g_source_remove -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022690.html
libglib2 changes -> http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022699.html
===

Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:

==============

exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
    |_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
        |_> stonith_action_clear_tracking_data(action);
            |_> g_source_remove(action->timer_sigterm);
                |_> g_critical ("Source ID %u was not found when attempting to remove it", tag);

WHERE
==============

Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"

This is the script responsible to monitor a stonith resource and it has returned (triggering monitor
callback) with the following data on it:

------ data (begin) ------
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_timeout=20
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------

OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.

As soon as it dies, the callback for this process is called:

    if (child->callback) {
        child->callback(child, child->pid, core, signo, exitcode);

In our case, callback is:

0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)

with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a

g_critical ("Source ID %u was not found when attempting to remove it", tag);

and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).

POSSIBLE CAUSE:
==============

Under <stonith_action_async_done> we have:

stonith_action_t *action = 0x7f1f639f5b50.

    if (action->timer_sigterm > 0) {
        g_source_remove(action->timer_sigterm);
    }
    if (action->timer_sigkill > 0) {
        g_source_remove(action->timer_sigkill);
    }

Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
an...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I just found one upstream commit fixing this:

##############

commit 0326f05c9e26f39a394fa30830e31a76306f49c7
Author: Andrew Beekhof <email address hidden>
Date: Thu Aug 7 13:49:24 2014 +1000

    Fix: stonith-ng: Reset mainloop source IDs after removing them

diff --git a/lib/fencing/st_client.c b/lib/fencing/st_client.c
index 64bd8f3..2837682 100644
--- a/lib/fencing/st_client.c
+++ b/lib/fencing/st_client.c
@@ -663,9 +663,11 @@ stonith_action_async_done(mainloop_child_t * p, pid_t pid, int core, int signo,

     if (action->timer_sigterm > 0) {
         g_source_remove(action->timer_sigterm);
+ action->timer_sigterm = 0;
     }
     if (action->timer_sigkill > 0) {
         g_source_remove(action->timer_sigkill);
+ action->timer_sigkill = 0;
     }

     if (action->last_timeout_signo) {

##############

under <stonith_action_async_done>.

Will provide you a hotfix with this fix and ask for feedback.

Thank you

Rafael

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Peter,

Since the bug you are reporting is related to "stonith", I'm separating two cases:

LP: #1368737 -- Pacemaker (lrmd) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove

And

LP: #1412962 -- Pacemaker (stonith) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it

And I have subscribed you to this second one (where I'll be giving you a PPA for you to test).

Please let's continue the discussion regarding your problem in that particular bug.

Thank you

Rafael Tinoco

summary: - Pacemaker can seg fault on crm node online/standby
+ Pacemaker (lrmd) can seg fault in Trusty and Utopic after following
+ message: Source ID XX was not found when attempting to remove it
tags: added: verification-done
removed: verification-needed
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Brian Murray or James Pages,

I verified this fix for the test case in the description and it worked fine. Meanwhile I had some complains from Peter regarding crashes he was getting into his installation. I opened the following bug:

https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1412962

And provided 1st a fix to stonith cores and then to lrmd cores. I'm attaching in that bug the debdiffs that fixed all the issues Peter was seeing. I'll ask for sponsorship also.

Thank you

Rafael Tinoco

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-4ubuntu3.1

---------------
pacemaker (1.1.10+git20130802-4ubuntu3.1) utopic; urgency=medium

  [ James Page ]
  * d/control: Ensure that pacemaker binary package uses matched binary
    versions of pacemaker libraries, avoiding upgrade problems (LP: #1382842).

  [ Rafael David Tinoco ]
  * d/p/Fix-lrmd-Prevent-glib-timers-removed-more-than-once.patch:
    Prevent glib assert triggered by timers being removed from mainloop more
    than once, cherry picked from upstream VCS (LP: #1368737).
 -- Rafael David Tinoco <email address hidden> Thu, 18 Dec 2014 10:14:38 +0000

Changed in pacemaker (Ubuntu Utopic):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-1ubuntu2.2

---------------
pacemaker (1.1.10+git20130802-1ubuntu2.2) trusty; urgency=medium

  [ James Page ]
  * d/control: Ensure that pacemaker binary package uses matched binary
    versions of pacemaker libraries, avoiding upgrade problems (LP: #1382842).

  [ Rafael David Tinoco ]
  * d/p/Fix-lrmd-Prevent-glib-timers-removed-more-than-once.patch:
    Prevent glib assert triggered by timers being removed from mainloop more
    than once, cherry picked from upstream VCS (LP: #1368737).
 -- Rafael David Tinoco <email address hidden> Thu, 18 Dec 2014 10:09:39 +0000

Changed in pacemaker (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.