Ubuntu
cluster-glue package

Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket leak in lrmd.

Bug #821732 reported by Wolfgang Scherer on 2011-08-05

This bug affects 4 people

	Status	Importance	Assigned to
cluster-glue (Ubuntu)	Fix Released	High	Andres Rodriguez
Oneiric	Won't Fix	High	Andres Rodriguez
Precise	Fix Released	High	Andres Rodriguez

Bug Description

ii cluster-glue 1.0.7-3ubuntu2 The reusable cluster components for Linux HA

The comamnds `crm ra classes` and `cr ra list` cause a socket leak in the lrmd daemon.

When approx. 1024 sockets are allocated, the lrmd becomes unresponsive and must be killed.
The syslog then shows repeated entries:

Aug 3 10:25:08 server lrmd: [1941]: ERROR: socket_accept_connection: accept(sock=6): Too many open files

While I only use these commands during development, it is still a nuisance.

The leak does not appear for other commands, e.g. `crm resource
list`, but I have not tested exhaustively.

I originally reported this bug to http://developerbugs.linux-foundation.org/show_bug.cgi?id=2626.

There I was informed that the behavior most likely stems from an
unsupported patch (raexecupstart.patch) in the Ubuntu package.
When I remove that patch, the socket leaks does indeed go away.

Although I did not have any "deadlock" situations with the original
code, I replaced it with the attached patch which should prevent any
possible recursive calls of the `on_remove_client' function.

*******************************
After further investigation it was determined that the problem was in glib itself and the patch was not needed in the latest's releases of Ubuntu, but rather, this patches were creating the socket leak.

See original description

Tags:

Related branches

lp:ubuntu/precise/cluster-glue

Revision history for this message

Wolfgang Scherer (wolfgang-scherer) wrote on 2011-08-05:

avoid recursive invocation of on_remove_client Edit (709 bytes, text/plain)

Ante Karamatić (ivoks) on 2011-11-01

Changed in cluster-glue (Ubuntu):
status:	New → Confirmed
assignee:	nobody → Ante Karamatić (ivoks)

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-01:

Hi Wolfgang

I've tested your patch and I didn't have luck with it (while loop of crm ra classes still brings lrmd to its knees; socket count hits the limit). Does that patch work for you?

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2011-11-01:

The attachment "avoid recursive invocation of on_remove_client" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags:

added: patch

Revision history for this message

Wolfgang Scherer (wolfgang-scherer) wrote on 2011-11-03: Re: socket leak in lrmd

Hi Ante,

the patch does indeed work for me.

Here is exactly what I did:

dpkg-source -x cluster-glue_1.0.7-3ubuntu2.dsc

> dpkg-source: info: extracting cluster-glue in cluster-glue-1.0.7
> dpkg-source: info: unpacking cluster-glue_1.0.7.orig.tar.bz2
> dpkg-source: info: unpacking cluster-glue_1.0.7-3ubuntu2.debian.tar.gz
> dpkg-source: info: applying raexecupstart.patch

patch -R -p 0 <cluster-glue-1.0.7/debian/patches/raexecupstart.patch
patch -p 0 <bug-check-lrmd.dif
cd cluster-glue-1.0.7
dpkg-buildpackage
dpkg -i cluster-glue_1.0.7-3ubuntu2_amd64.deb

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-06:

Which Ubuntu version is that?

I've noticed that with same source built on Lucid and Oneiric I get different results. On Oneiric it works, on Lucid it doesn't.

Andres Rodriguez (andreserl) on 2011-11-07

Changed in cluster-glue (Ubuntu):
assignee:	Ante Karamatić (ivoks) → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Oneiric):
assignee:	nobody → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Precise):
importance:	Undecided → High
Changed in cluster-glue (Ubuntu Oneiric):
importance:	Undecided → High
status:	New → Incomplete
status:	Incomplete → Confirmed

Revision history for this message

Wolfgang Scherer (wolfgang-scherer) wrote on 2011-11-07:

Ubuntu natty.
I have not checked lucid.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-11-07:

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu1

---------------
cluster-glue (1.0.8-2ubuntu1) precise; urgency=low

  * debian/patches (LP: #821732):
    - raexecupstart.patch: Drop as this does not fix the leak issue.
    - fix_lrmd_leak.patch: Add new patch that correctly fixes the issue.
-- Andres Rodriguez <email address hidden> Mon, 07 Nov 2011 14:49:50 -0500

Changed in cluster-glue (Ubuntu Precise):
status:	Confirmed → Fix Released

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-08:

Maverick gives the same results as lucid. I believe a change in glib between maverick and natty solved this problem.

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-08:

glib-context-unlock.patch Edit (575 bytes, text/plain)

For Lucid and Maverick Wolfgang's patch for cluster-glue isn't enough. Patch is good, but glib has an issue. There is an upstream fix for it:

https://mail.gnome.org/archives/commits-list/2010-November/msg01816.html

Patch attached is tested on Lucid. With Wolfgang's patch for cluster-glue, both deadlock and socket leaks are eliminated.

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-08:

#10

Howto test bug and fix

Install lucid
add ubuntu-ha-maintainers ppa and update repo:
        apt-add-repository ppa:ubuntu-ha-maintainers/ppa ; apt-get update
Install pacemaker:
        apt-get -y install pacemaker
Enable corosync (/etc/default/corosync) and start it:
        sed -i -e 's/START=no/START=yes/' /etc/default/corosync ; \
        service corosync start
Open few client->server connections:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
Check number of open sockets:
        lsof -f | grep lrm_callback_sock | wc -l
Correct value is 2, but it will be 6 or 8. There's a socket leak.

Stop corosync:
        service corosync stop
Add ppa:ivoks/ha:
        apt-add-repository ppa:ivoks/ha ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Repeate the test with client->server connections:
        lrmadmin -C ; lrmadmin -C
It deadlocks on second run

Kill lrmd and stop corosync:
        killall -9 lrmd ; service corosync stop
Add ppa:ivoks/glib:
        apt-add-repository ppa:ivoks/glib ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Run the test again:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
It doesn't deadlock.
Check the socket count:
        lsof -f | grep lrm_callback_sock | wc -l
It's 2. Socket do not leak and glib doesn't deadlock.

Revision history for this message

Ante Karamatić (ivoks) wrote on 2011-11-09:

#11

Actually, if both Wolfgang's and raexecupstart patch are dropped, lrmd/lrmclient will work as expected in 11.04, 11.10 and Precise.

Once glib is fixed in 10.04 and 10.10, we can drop raxecupstart patch in cluster-glue for those versions.

So, Andres, please drop all cluster-glue patches in Precise. Please remove raexecupstart patch in cluster-glue for 11.04 and 11.10 and ask for SRUs.

For 10.04 and 10.10, we need to get glib fix SRU first and then cluster-glue SRU, removing raexecupstart patch.

Andres Rodriguez (andreserl) on 2011-11-10

summary:	- socket leak in lrmd + Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket + leak in lrmd.
description:	updated
Changed in cluster-glue (Ubuntu Precise):
status:	Fix Released → In Progress

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-11-10:

#12

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu2

---------------
cluster-glue (1.0.8-2ubuntu2) precise; urgency=low

* debian/patches/fix_lrmd_leak.patch: Drop as the issue was in glib and it
is now fixed. (LP: #821732)
-- Andres Rodriguez <email address hidden> Thu, 10 Nov 2011 13:22:47 -0500

Changed in cluster-glue (Ubuntu Precise):
status:	In Progress → Fix Released

Revision history for this message

Jacob Smith (jsmith-argotecinc) wrote on 2012-01-09:

#13

Not sure these symptoms would be the same on the later versions but on Lucid I also encountered the following due to this bug:

service corosync stop - never completes. Hangs waiting for crmd to shutdown (over 6 hours without a change)
crm ra info ocf:xx:xx - hangs the crm shell
crm configure primitive p_test ocf: - hung when trying to use tab completion of ocf:<tab>

Revision history for this message

Rolf Leggewie (r0lf) wrote on 2014-12-03:

#14

oneiric has seen the end of its life and is no longer receiving any updates. Marking the oneiric task for this ticket as "Won't Fix".

Changed in cluster-glue (Ubuntu Oneiric):
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #676391

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntucluster-glue package

Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket leak in lrmd.

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Remote bug watches

Ubuntu
cluster-glue package