gdm hangs altogether after timeout on the gdm socket

Bug #62139 reported by Václav Šmilauer
2
Affects Status Importance Assigned to Milestone
gdm
Fix Released
Critical
gdm (Ubuntu)
Fix Released
High
Ubuntu Desktop Bugs
Dapper
Invalid
High
Unassigned

Bug Description

Binary package hint: gdm

Since upgrade to dapper final, we experience frequent breakage of gdm on an amd64 (64-bit dapper) XDMCP server serving regularly around 40 clients.

Reproducibility: happens when e.g. many people log out at the same time (once in a few days), gdm must be killed and started manually afterwards (killing all the existing sessions as well).

Symptoms: slave gdm processe continue to work, but the main gdm process does not spawn new slaves, it does not ping existing ones every 15s as it does normally (from the debug syslog), does not repond to TERM (must be KILLed) - as if it were waiting for something (race?).

Logs reveal the only difference between normal situation and the bug in the timeout on the gdm socket (I will attach the full log):

Sep 22 14:16:40 [gdm] Sending LOGGED_IN == 0 for slave 317
Sep 22 14:16:40 [gdm] Timeout occurred for sending message LOGGED_IN 317 0

What might be the reason? In slave.c:gdm_slave_send, up to 10 attempts are made to deliver the message (select on
 &rfds), but select apparently return error, since the timeout never expires (otherwise, it would have to take 10s between the message sending and the timeout).

PS. I compiled gdm with an added line for tracing the message sending and will post results if they are relevant. (daemon/slave.c):

@@ -2767,6 +2766,7 @@
                if (in_usr2_signal > 0) {
                        fd_set rfds;
                        struct timeval tv;
+ int select_retval;

                        FD_ZERO (&rfds);
                        FD_SET (d->slave_notify_fd, &rfds);
@@ -2775,9 +2775,10 @@
                        tv.tv_sec = 1;
                        tv.tv_usec = 0;

- if (select (d->slave_notify_fd+1, &rfds, NULL, NULL, &tv) > 0) {
+ if ((select_retval = select (d->slave_notify_fd+1, &rfds, NULL, NULL, &tv)) > 0) {
                                gdm_slave_handle_usr2_message ();
                        }
+ if (select_retval < 0) gdm_debug("TRACE (%s,%d): select returned errno %d (%s)",__FILE__,__LINE__,select_retval,strerror(select_retval));
                } else {
                        struct timeval tv;
                        /* Wait 1 second. */
@@ -2787,6 +2788,7 @@
                        /* don't want to use sleep since we're using alarm
                           for pinging */
                }
+ gdm_debug ("TRACE (%s,%d): Passed gdm_slave_send cycle, i=%d, in_usr2_signal=%d, wait_for_ack=%d, gdm_got_ack=%d.",__FILE__,__LINE__,i,in_usr2_signal,wait_for_ack,gdm_got_ack);
        }

        if G_UNLIKELY (wait_for_ack &&

Revision history for this message
Václav Šmilauer (eudoxos) wrote :

This is the log from where the error shows up, up to restart (happened friday afternoon, restarted monday). Note that the pinging from the main gdm process is not present anymore. The "Fatal X error detected." is present even under normal conditions and hence is not cause of the bug.

Revision history for this message
Sebastien Bacher (seb128) wrote :

Thanks for your bug. I've forwarded the bug upstream, they know the code better: http://bugzilla.gnome.org/show_bug.cgi?id=357632
What version of gdm do you use? 2.14.6 from dapper or 2.4.10 from dapper-updates?

Changed in gdm:
assignee: nobody → desktop-bugs
importance: Untriaged → High
status: Unconfirmed → Confirmed
Revision history for this message
Sebastien Bacher (seb128) wrote :

Comment from upstream:

"Typically the problem when the GDM daemon becomes so unresponsive is because it
has crashed. Note that GDM works like this;

1. The main GDM daemon starts. This is the process that has the pid listed in
   /var/run/gdm.pid. Note that most programs try to connect to this process
   when they want to communicate. For example gdmflexiserver looks up the
   daemon process Pid in this file.

2. For each display, a forked daemon (aka the slave daemon) is started.

3. The forked daemon starts the GUI program (gdmlogin/gdmgreeter).

Note it can look like daemon's are still running even when the main daemon has
crashed since the forked slave daemons (#2 above) may still be running.

If the main daemon is crashing, has it left a core file somewhere. Look for
GDM related core files in the /, /var/lib/gdm, /var/lib/log/gdm, or the GDM
user's $HOME directory (if any).

I suspect all of the above won't help you since it sounds like the main daemon
isn't crashing. I just mention this so you can double check, since this is
typically the problem for this sort of symptom.

GDM uses a couple of different mechanisms to communicate between the main
daemon and other processes. There is the GDM_SUP* messages that are used by
various programs to query the daemon about different hings. What happens when
you
run this command. It tests the GDM_SUP* messages. Does it work?

   gdmflexiserver --command=VERSION

The other kind of messages are GDM_SOP_* messages. For these messages, the
forked slave process sends the main daemon a SIGUSR1 signal and then uses a
fifo protocol for passing the messages. Note the function create_connections
in daemon/gdm.c and note gdm_slave_send in daemon/slave.c. In your syslog, it
seems like messages like GDM_SOP_LOGGED_IN (aka LOGGED_IN) is failing. So
this is probably where it is breaking. I'm wondering if this is the *only*
kind of message that is breaking or if just GDM_SOP messages are broken for
you.

If I were to guess what the problem might be, it could be that GDM may be
executing some code in the SIGUSR1 signal handler that is not signal handler \
safe. Since it works most of the time, but fails in weird situations where
many users log out at once, this might be triggering a normally rare event that
makes the signal handler mad.

Note in slave.c, the gdm_slave_usr2_handler function. Might be worth looking
into this function more closely.

At this point, I'm just giving you some pointers to hopefully narrow down what
the problem may be. Perhaps you could investigate, and if possible add some
more gdm_debug messages so we can have a better understanding of where the
failure may be happening?"

Changed in gdm:
status: Unknown → Unconfirmed
Revision history for this message
Václav Šmilauer (eudoxos) wrote :

I am sorry to say that after trying to make gdm dump core, I did not succeed. Any hints on that? I have "ulimit -c unlimited" (and it is reported by ulimit -a afterwards). I put "root core soft ulimited" into /etc/security/limits. After sending SIGSEGV to gdm, it silently disappears, no core in /var/lib/gdm, which is proc/PID/cwd. Bizzare.

I already recompiled it with -ggdb3 to have a nice dump, but this sort of prevents it from being useful. Any hints? After that, I hope that things will move on quickly.

Revision history for this message
Václav Šmilauer (eudoxos) wrote :

Hello, the bug was very likely identified as a duplicate of GNOME#352838. (http://bugzilla.gnome.org/show_bug.cgi?id=352838). This patch fixed the problem for the reporter of that bug: http://cvs.gnome.org/viewcvs/gdm2/daemon/gdm.c?r1=1.260&r2=1.260.2.1&makepatch=1&diff_format=u

Please apply it to dapper-updates or dapper-security (since it may cause data loss) as soon as it is feasible.

Revision history for this message
Sebastien Bacher (seb128) wrote :

Thank you for your work on that. That's not a security issue so it'll not be applied to dapper-security. Did you build a package with the patch and can you confirm it fixes the issue? The new policy about uploades to stable-updates is rather strict and having testing would be better than a "very likely" supposition

Revision history for this message
Sebastien Bacher (seb128) wrote :

marking as fixed since it's fixed on edgy and opening a backport task for dapper

Changed in gdm:
status: Confirmed → Fix Released
importance: Undecided → High
status: Unconfirmed → Confirmed
Revision history for this message
Václav Šmilauer (eudoxos) wrote :

"Very likely" means that it segfaults on the same line as the other bugreport says and the cause is a rather obvious possibly NULL pointer that was not handled in the previous version. I will confirm the functionality if it doesn't crash in a few days (production system).

Where can I track the backport task? It was not able to find it at launchpad. Thanks.

Revision history for this message
Sebastien Bacher (seb128) wrote : Re: [Bug 62139] Re: gdm hangs altogether after timeout on the gdm socket

Le mardi 17 octobre 2006 à 12:41 +0000, Václav Šmilauer a écrit :

> Where can I track the backport task? It was not able to find it at
> launchpad. Thanks.

one the same bug page, tasks are listed to the table on the top of the
page

Revision history for this message
Václav Šmilauer (eudoxos) wrote :

Back once again. I just checked if the patch applies against the exact dapper version and found out that I was accidentally working with the edgy source as the patch did not make any sense there. Of course, the gdm that was installed was (accidentally) from early edgy as well (despite /etc/apt/preferences - maybe it was pulled by some dependency?). I am so ashamed.

Hence no backport will be necessary and this while bug may be just marked as invalid. Nice try, though.

Revision history for this message
Sebastien Bacher (seb128) wrote :

thank you for the update, marking the backport task as rejected

Changed in gdm:
status: Confirmed → Rejected
Changed in gdm:
status: Unconfirmed → Rejected
Changed in gdm:
status: Rejected → Fix Released
Changed in gdm:
importance: Unknown → Critical
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.