Race condition in lightdm greeter setup

Bug #1172752 reported by Veli-Matti Lintu
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Light Display Manager
Triaged
High
Unassigned

Bug Description

Tested lightdm versions: 1.4.0-0ubuntu2 and 1.6.0-0ubuntu2 on Ubuntu 12.10

I come across a bug in lightdm that causes lightdm to die randomly when starting or after logout. It happens only on some machines and on some machines it never happens. My best guess here is a race condition.

During greeter setup "lightdm --session-child" is spawned twice. The first call to session_start() that does fork+execlp() is done from greeter_start() and right after that handle_login() calls session_stop() + session_start(). session_stop() sends SIGTERM to the child process, but if the child has not managed to call execlp() yet, it still has signal handler set for SIGTERM, because fork copies the parent's signal handlers to the child. Now when session_stop() sends SIGTERM to the child, it uses the signal handler of the parent which causes the signal go to signal_cb() that then signals the main lightdm process to die.

I managed to get rid of the problem by switching fork() -> vfork() in session_start() which blocks the parent before execlp() is run and signal handlers are cleared. This ensures that the signal sent from session_stop() always ends up only to the child.

The signal handling in lightdm looks quite tricky, so I'm not really sure if this is the best way to handle this. Maybe it would be possible to get rid of the first "lightdm --session-child" spawning to make all this unnecessary?

Relevant pieces of code:

The fork() is called here and the calling process has signal handlers set.

-----------> src/session.c <--------------
gboolean
session_start (Session *session, const gchar *service, const gchar *username, gboolean do_authenticate, gboolean is_interactive, gboolean is_guest)
{
...
    /* Run the child */
    session->priv->pid = fork ();
    if (session->priv->pid < 0)
    {
        g_debug ("Failed to fork session child process: %s", strerror (errno));
        return FALSE;
    }

    if (session->priv->pid == 0)
    {
        /* Run us again in session child mode */
        execlp ("lightdm",
                "lightdm",
                "--session-child",
                g_strdup_printf ("%d", to_child_output),
                g_strdup_printf ("%d", from_child_input),
                NULL);
        _exit (EXIT_FAILURE);
    }
...
}

session_stop (Session *session)
{
    g_return_if_fail (session != NULL);

    if (session->priv->pid > 0)
    {
        g_debug ("Session %d: Sending SIGTERM", session->priv->pid);
        kill (session->priv->pid, SIGTERM);
        // FIXME: Handle timeout
    }
}
-----------> src/session.c <--------------

session_start() is called first time from greeter_start()

-----------> src/greeter.c <--------------
gboolean
greeter_start (Greeter *greeter, const gchar *service, const gchar *username)
{
...
    result = session_start (greeter->priv->session, service, username, FALSE, FALSE, FALSE);
...
}
-----------> src/greeter.c <--------------

Almost immediately after session_start() is called, reset_session() is callend which then calls session_stop(). After that session_start() is called again which spawns a new "lightdm --session-child" process.

-----------> src/greeter.c <--------------
static void
handle_login (Greeter *greeter, guint32 sequence_number, const gchar *username)
{
...
    reset_session (greeter);
...
    /* Run the session process */
    session_start (greeter->priv->authentication_session, service, username, TRUE, is_interactive, FALSE);
...
}
-----------> src/greeter.c <--------------

Because sometimes fork+execlp() has not finished before SIGTERM is sent, the SIGTERM ends up here instead of the callback in session-child.c:

-----------> src/process.c <--------------
signal_cb (int signum, siginfo_t *info, void *data)
{
    /* NOTE: Using g_printerr as can't call g_warning from a signal callback */
    if (write (signal_pipe[1], &info->si_signo, sizeof (int)) < 0 ||
        write (signal_pipe[1], &info->si_pid, sizeof (pid_t)) < 0)
        g_printerr ("Failed to write to signal pipe: %s", strerror (errno));
}

static gboolean
handle_signal (GIOChannel *source, GIOCondition condition, gpointer data)
{
    int signo;
    pid_t pid;
    Process *process;

    if (read (signal_pipe[0], &signo, sizeof (int)) < 0 ||
        read (signal_pipe[0], &pid, sizeof (pid_t)) < 0)
    {
        g_warning ("Error reading from signal pipe: %s", strerror (errno));
        return TRUE;
    }

    g_debug ("Got signal %d from process %d", signo, pid);

    process = g_hash_table_lookup (processes, GINT_TO_POINTER (pid));
    if (process == NULL)
        process = process_get_current ();
    if (process)
        g_signal_emit (process, signals[GOT_SIGNAL], 0, signo);

    return TRUE;
}
-----------> src/process.c <--------------

Revision history for this message
Veli-Matti Lintu (vmlintu) wrote :
Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Not to comment on lightdm which I'm not familiar with, but for the benefit of doubt it'd be useful to know that this also happens after the just released different race condition fix in lightdm 1.6.0-0ubuntu2.1 and plymouth 0.8.8-0ubuntu6.1, bug #982889. That one happens on intel graphics.

The lightdm patch is at http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/raring/lightdm/raring-proposed/revision/91
And plymouth patch seems to be this one http://bazaar.launchpad.net/~tjaalton/plymouth/race-fix/revision/1447

Revision history for this message
Veli-Matti Lintu (vmlintu) wrote :

All the testing was done by running lightdm manually through ssh when plymouth was not running, so these two should not be connected.

The behaviour described in this bug has been seen also by running lightdm through Upstart and LTSP's screen.d system. When running lightdm through Upstart, adding "respawn" in /etc/init/lightdm.conf mitigates the problem and makes Upstart restart lightdm when it crashes and users see only some screen flashing.

Running lightdm with -d option usually prevents the bug from occuring as it changes the timings.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Ok, thanks for the clarification.

Changed in lightdm:
status: New → Confirmed
Changed in lightdm:
importance: Undecided → High
Revision history for this message
Heavy Rail (heavyrail) wrote :

I use LightDM in multiuser login VNC configuration similar to one described at http://www.ibm.com/developerworks/linux/library/os-multiuserloginsvnc/index.html

In such a setup double-spawning is a major issue. The reason is two childs is spawned once user logs in, while just one child is killed once user logs out. This leads to the one "beautiful" moment when new user can not log in and you have more than 100 orphaned child lightdm processes in `ps` output.

I tried to use the patch mentioned above, but got no luck because this patch cures the sympthoms, but does not cure the true disease - double spawning.

Changed in lightdm:
status: Confirmed → Incomplete
Revision history for this message
Robert Ancell (robert-ancell) wrote :

Do you have a lightdm.log after the problem occurs?

It is correct that LightDM spawns two sessions - one is for the greeter and one is for the user session being authenticated.

I think you are correct there is a race between the fork and the exec - if a signal is received before the exec it will be caught and sent over the pipe to the main process. (The pipe is closed in the exec so this can't normally happen).

Revision history for this message
Robert Ancell (robert-ancell) wrote :

Veli-Matti, can you try the fix in lp:~robert-ancell/lightdm/signal-fix and see if that works for you?

I can reproduce the problem if you get a greeter to start authentication twice in a row - in this case it seems likely that the first session child is signalled before the exec occurs and this causes the daemon to quit. In a real life case this might be possible to trigger by rapidly switching between users in a greeter (e.g. holding the up/down keys in Unity greeter).

Changed in lightdm:
status: Incomplete → Triaged
Revision history for this message
Veli-Matti Lintu (vmlintu) wrote :

The fix in comment #7 fixes the exiting problem, but it seems to cause another problem.

I just did testing with the patches applied and when lightdm starts normally, these two child processes are alive when greeter is shown:

root 3049 3021 0 11:14 pts/1 00:00:00 lightdm --session-child 16 19
root 3115 3021 0 11:14 pts/1 00:00:00 lightdm --session-child 12 21

When the signal-fix patch catches the problem and prevents the parent from killing itself, the child stays alive, so there are now three session-child processes running when greeter is shown. Here pid 2939 is the first killed greeter child that gets killed before exec() and that didn't get cleaned away with the signal-fix patch. This happens without me touching the keyboard or the mouse.

root 2874 2846 0 11:13 pts/1 00:00:00 lightdm --session-child 16 19
root 2939 2846 0 11:13 pts/1 00:00:00 lightdm --session-child 12 19
root 2940 2846 0 11:13 pts/1 00:00:00 lightdm --session-child 12 21

So it seems to fix the worst symptom, but because the first spawned greeter child does not die when the race occurs, it stays alive and the number of lightdm --session-child processes just keeps growing.

Revision history for this message
Robert Ancell (robert-ancell) wrote :

Can anyone confirm this with lightdm 1.9.11 or greater?

Revision history for this message
cbrnr (cbrnr) wrote :

I have similar symptons on my Arch Linux machine. I always need to log in twice (with the correct credentials), because after the first time, the screen goes blank for a short moment and returns to the login screen. The second time always works. In addition, I noticed that if I wait some time (around 1 minute) before I log in, it works also in the first attempt.

I'm not sure if this is the same problem, let me know if you need more information or if I should open a new bug report.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.