Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket leak in lrmd.

Bug #821732 reported by Wolfgang Scherer
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
cluster-glue (Ubuntu)
Fix Released
High
Andres Rodriguez
Oneiric
Won't Fix
High
Andres Rodriguez
Precise
Fix Released
High
Andres Rodriguez

Bug Description

ii cluster-glue 1.0.7-3ubuntu2 The reusable cluster components for Linux HA

The comamnds `crm ra classes` and `cr ra list` cause a socket leak in the lrmd daemon.

When approx. 1024 sockets are allocated, the lrmd becomes unresponsive and must be killed.
The syslog then shows repeated entries:

  Aug 3 10:25:08 server lrmd: [1941]: ERROR: socket_accept_connection: accept(sock=6): Too many open files

While I only use these commands during development, it is still a nuisance.

The leak does not appear for other commands, e.g. `crm resource
list`, but I have not tested exhaustively.

I originally reported this bug to http://developerbugs.linux-foundation.org/show_bug.cgi?id=2626.

There I was informed that the behavior most likely stems from an
unsupported patch (raexecupstart.patch) in the Ubuntu package.
When I remove that patch, the socket leaks does indeed go away.

Although I did not have any "deadlock" situations with the original
code, I replaced it with the attached patch which should prevent any
possible recursive calls of the `on_remove_client' function.

*******************************
After further investigation it was determined that the problem was in glib itself and the patch was not needed in the latest's releases of Ubuntu, but rather, this patches were creating the socket leak.

Tags: patch
Revision history for this message
Wolfgang Scherer (wolfgang-scherer) wrote :
Ante Karamatić (ivoks)
Changed in cluster-glue (Ubuntu):
status: New → Confirmed
assignee: nobody → Ante Karamatić (ivoks)
Revision history for this message
Ante Karamatić (ivoks) wrote :

Hi Wolfgang

I've tested your patch and I didn't have luck with it (while loop of crm ra classes still brings lrmd to its knees; socket count hits the limit). Does that patch work for you?

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "avoid recursive invocation of on_remove_client" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Revision history for this message
Wolfgang Scherer (wolfgang-scherer) wrote : Re: socket leak in lrmd

Hi Ante,

the patch does indeed work for me.

Here is exactly what I did:

dpkg-source -x cluster-glue_1.0.7-3ubuntu2.dsc

> dpkg-source: info: extracting cluster-glue in cluster-glue-1.0.7
> dpkg-source: info: unpacking cluster-glue_1.0.7.orig.tar.bz2
> dpkg-source: info: unpacking cluster-glue_1.0.7-3ubuntu2.debian.tar.gz
> dpkg-source: info: applying raexecupstart.patch

patch -R -p 0 <cluster-glue-1.0.7/debian/patches/raexecupstart.patch
patch -p 0 <bug-check-lrmd.dif
cd cluster-glue-1.0.7
dpkg-buildpackage
dpkg -i cluster-glue_1.0.7-3ubuntu2_amd64.deb

Revision history for this message
Ante Karamatić (ivoks) wrote :

Which Ubuntu version is that?

I've noticed that with same source built on Lucid and Oneiric I get different results. On Oneiric it works, on Lucid it doesn't.

Changed in cluster-glue (Ubuntu):
assignee: Ante Karamatić (ivoks) → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Oneiric):
assignee: nobody → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Precise):
importance: Undecided → High
Changed in cluster-glue (Ubuntu Oneiric):
importance: Undecided → High
status: New → Incomplete
status: Incomplete → Confirmed
Revision history for this message
Wolfgang Scherer (wolfgang-scherer) wrote :

Ubuntu natty.
I have not checked lucid.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu1

---------------
cluster-glue (1.0.8-2ubuntu1) precise; urgency=low

  * debian/patches (LP: #821732):
    - raexecupstart.patch: Drop as this does not fix the leak issue.
    - fix_lrmd_leak.patch: Add new patch that correctly fixes the issue.
 -- Andres Rodriguez <email address hidden> Mon, 07 Nov 2011 14:49:50 -0500

Changed in cluster-glue (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Ante Karamatić (ivoks) wrote :

Maverick gives the same results as lucid. I believe a change in glib between maverick and natty solved this problem.

Revision history for this message
Ante Karamatić (ivoks) wrote :

For Lucid and Maverick Wolfgang's patch for cluster-glue isn't enough. Patch is good, but glib has an issue. There is an upstream fix for it:

https://mail.gnome.org/archives/commits-list/2010-November/msg01816.html

Patch attached is tested on Lucid. With Wolfgang's patch for cluster-glue, both deadlock and socket leaks are eliminated.

Revision history for this message
Ante Karamatić (ivoks) wrote :

Howto test bug and fix

Install lucid
add ubuntu-ha-maintainers ppa and update repo:
        apt-add-repository ppa:ubuntu-ha-maintainers/ppa ; apt-get update
Install pacemaker:
        apt-get -y install pacemaker
Enable corosync (/etc/default/corosync) and start it:
        sed -i -e 's/START=no/START=yes/' /etc/default/corosync ; \
        service corosync start
Open few client->server connections:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
Check number of open sockets:
        lsof -f | grep lrm_callback_sock | wc -l
Correct value is 2, but it will be 6 or 8. There's a socket leak.

Stop corosync:
        service corosync stop
Add ppa:ivoks/ha:
        apt-add-repository ppa:ivoks/ha ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Repeate the test with client->server connections:
        lrmadmin -C ; lrmadmin -C
It deadlocks on second run

Kill lrmd and stop corosync:
        killall -9 lrmd ; service corosync stop
Add ppa:ivoks/glib:
        apt-add-repository ppa:ivoks/glib ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Run the test again:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
It doesn't deadlock.
Check the socket count:
        lsof -f | grep lrm_callback_sock | wc -l
It's 2. Socket do not leak and glib doesn't deadlock.

Revision history for this message
Ante Karamatić (ivoks) wrote :

Actually, if both Wolfgang's and raexecupstart patch are dropped, lrmd/lrmclient will work as expected in 11.04, 11.10 and Precise.

Once glib is fixed in 10.04 and 10.10, we can drop raxecupstart patch in cluster-glue for those versions.

So, Andres, please drop all cluster-glue patches in Precise. Please remove raexecupstart patch in cluster-glue for 11.04 and 11.10 and ask for SRUs.

For 10.04 and 10.10, we need to get glib fix SRU first and then cluster-glue SRU, removing raexecupstart patch.

summary: - socket leak in lrmd
+ Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket
+ leak in lrmd.
description: updated
Changed in cluster-glue (Ubuntu Precise):
status: Fix Released → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu2

---------------
cluster-glue (1.0.8-2ubuntu2) precise; urgency=low

  * debian/patches/fix_lrmd_leak.patch: Drop as the issue was in glib and it
    is now fixed. (LP: #821732)
 -- Andres Rodriguez <email address hidden> Thu, 10 Nov 2011 13:22:47 -0500

Changed in cluster-glue (Ubuntu Precise):
status: In Progress → Fix Released
Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Not sure these symptoms would be the same on the later versions but on Lucid I also encountered the following due to this bug:

service corosync stop - never completes. Hangs waiting for crmd to shutdown (over 6 hours without a change)
crm ra info ocf:xx:xx - hangs the crm shell
crm configure primitive p_test ocf: - hung when trying to use tab completion of ocf:<tab>

Revision history for this message
Rolf Leggewie (r0lf) wrote :

oneiric has seen the end of its life and is no longer receiving any updates. Marking the oneiric task for this ticket as "Won't Fix".

Changed in cluster-glue (Ubuntu Oneiric):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.