upstart loses track of ssh daemon after reload ssh

Bug #687535 reported by AlainKnaff
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
upstart
Invalid
Undecided
Unassigned
openssh (Ubuntu)
Fix Released
High
Clint Byrum
Lucid
Fix Released
High
Unassigned
Maverick
Fix Released
High
Unassigned

Bug Description

When sshd gets a signal 1 for reload, it forks a new process and ditches the old. This causes upstart to believe that ssh has crashed, and loses track of it. A second reload (or any other initctl operation on ssh) will thus say:

reload: Unknown instance:

There would be 2 ways to fix this:
1. Don't have ssh fork on relod, but keep the same pid
2. Use a different mechanism in upstart to keep track of ssh. Maybe a pid file? Just tracking children of the exited ssh won't work, or it might accidentally track a particular session rather than the master, if somebody just happens to log in close to reload time.

openssh-server 1:5.3p1-3ubuntu4
upstart 0.6.5-7

==== Info for Maverick, Lucid SRU ====
IMPACT: if sshd gets a HUP signal, it forks a new process and upstart thinks the process died and loses track of it, so the user/admin uses the ability to stop/start/reload the daemon through upstart.
The problem is fixed in Natty 5.6p1-2ubuntu3. See attached patches for Maverick and Lucid.

TEST CASE:

- install openssh-server
- send a HUP signal to sshd
- the daemon is restarted, but upstart thinks that it crashed (/var/log/daemon.log):

Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
Dec 28 20:59:57 utest-lls32 init: ssh main process (1451) terminated with status 255
Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
Dec 28 20:59:57 utest-lls32 init: ssh main process (1455) terminated with status 255
Dec 28 20:59:57 utest-lls32 init: ssh respawning too fast, stopped

- after this, upstart won't know about sshd, despite the daemon running just fine:

root@utest-lls32:~# reload ssh
reload: Unknown instance:

With the fix applied, the correct behavior is:

- send a HUP signal to sshd
  ps ax |grep sshd
  kill -HUP sshd
- the daemon reloads (/var/log/auth.log):

Dec 28 21:37:01 utest-lls32 sshd[742]: Received SIGHUP; restarting.
Dec 28 21:37:01 utest-lls32 sshd[742]: Server listening on 0.0.0.0 port 22.
Dec 28 21:37:01 utest-lls32 sshd[742]: Server listening on :: port 22.

- reloading with upstart gives the same result, and NOT an error message.

REGRESSION POTENTIAL:

There is a small race condition in sshd between when it forks, and when it listens for incoming connections. The length of the race is lengthened by a very tiny amount by considering sshd started as soon as it has been executed, rather than when it forks. This will only affect jobs that use 'start on started ssh' and immediately connect to it. This is unlikely to cause problems in any real world scenario, given that most of these programs would also have to fork, exec, and open a socket, which is more work than what sshd will be doing in that time.

Related branches

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Calling this low priority since
  kill -9 `pidof sshd`
  start ssh

is a workaround.

Changed in openssh (Ubuntu):
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
AlainKnaff (kubuntu-misc) wrote :

So why bother having an upstart or init at all, and not start and stop everything "manually"?

Similar issue exists with squid (bug 573853)

Upstart is a nice concept, and really improved boot times. It would even be better if it was more reliable. These glitches break web administration tools which call service xxx start/stop/reload to do their thing.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Alain, I understand your frustration. This is actually I think a little more serious than Low, as existence of a workaround only barely mitigates the impact of this.

The problem, I think, is that we're using expect fork, and I'm not sure why, when sshd has -D, and its clearly not reliable for this instance, since the daemon regularly forks and exits.

Raising importance to High, as the reload command *must* be reliable for sshd, as failure of sshd could result in the system being totally unmanageable.

Changed in openssh (Ubuntu):
importance: Low → High
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Also this is not really a new bug in upstart. It is related to bug #530779 which causes issues when the parent dies, but its not the same issue, and in fact, the proc connector probably wouldn't help because the proc connector still doesn't solve the problem of knowing when this fork is actually a good one that needs to be followed as the main process versus a fork of a child that just needs to be reaped.

I'm not going to mark it Invalid yet though. I'd like to get some Upstart experts' opinions on whether or not it makes sense to solve this, and really, bug #530779, by simply saying if the process we're tracking forks, then quits, that we should always track the children as the main process.

Revision history for this message
AlainKnaff (kubuntu-misc) wrote :

> Alain, I understand your frustration. think a little more serious than Low,

Thanks for your insight.

> as existence of a workaround only barely mitigates the impact of this.

Just a note about this workaround: kill -9 `pidof sshd` will "saw off the branch on which you're sitting" if you happen to manage the system remotely. A safer workaround would be kill -9 `cat /var/run/sshd.pid`

> The problem, I think, is that we're using expect fork

I think what's definitely needed is a way to make upstart track processes by pidfile.

> and I'm not sure why, when sshd has -D

Indeed, that seems to work (just tested it). So, why does ssh fork on reload in the case without -D, if it works without this fork with -D?

> I'm not going to mark it Invalid yet though.

Why do you think this bug should be marked invalid (/etc/init/ssh.conf is part of the openssh-server .deb)? Especially after basically agreeing with me... In general, in Ubuntu, triagers are far too quick to mark bugs as invalid, leading to situations where real problems still aren't fixed in the next release, or problems in LTS releases go unfixed for over a year.

> that we should always track the children as the main process.

This is tricky for processes such as ssh, as upstart might end up accidentally tracking one particular session, rather than the master. Just imagine if a user logs in at exactly the moment where the admin issues a reload ssh. Init would then not know which child to track (new master, or the session of the user who just logged in).
And in the case of squid, upstart would not even know that something is amiss as here the issue is about some intermediate process _not_ exiting when it should, so upstart would not know that something is amiss until after the first reload (... causing that first reload to be ignored...).

A safer alternative would be to optionally track by pidfile (which could be monitored by inotify, and re-read after a child exits). Of course, existence of pid listed in pidfile should be verified by sending it a -0 signal, so that upstart wouldn't get confused by stale pid files.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Alain,

Clint just meant marking the upstart task invalid, not the sshd one.

We've discussed on irc, and we'll follow the -D path. Thanks again for raising this bug.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Alain, I meant "Invalid in Upstart", and I'm sorry for the confusion. There is no doubt in my mind that the ssh.conf needs some changes, and should be fixed going back to Lucid.

The notion of pidfiles is generally rejected by the upstart community because "daemons don't get this right". I think the emphasis here is that upstart should be able to properly track processes without trusting their behavior.

Changed in openssh (Ubuntu Lucid):
status: New → Confirmed
Changed in openssh (Ubuntu Maverick):
status: New → Confirmed
Changed in openssh (Ubuntu Lucid):
importance: Undecided → High
Changed in openssh (Ubuntu Maverick):
importance: Undecided → High
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

In looking at the code in OpenSSH, the -D flag only affects this portion of sshd.c, around line 1740:

    /*
     * If not in debugging mode, and not started from inetd, disconnect
     * from the controlling terminal, and fork. The original process
     * exits.
     */
    if (!(debug_flag || inetd_flag || no_daemon_flag)) {
#ifdef TIOCNOTTY
        int fd;
#endif /* TIOCNOTTY */
        if (daemon(0, 0) < 0)
            fatal("daemon() failed: %.200s", strerror(errno));

        /* Disconnect from the controlling tty. */
#ifdef TIOCNOTTY
        fd = open(_PATH_TTY, O_RDWR | O_NOCTTY);
        if (fd >= 0) {
            (void) ioctl(fd, TIOCNOTTY, NULL);
            close(fd);
        }
#endif /* TIOCNOTTY */
    }

"no_daemon_flag" isn't used anywhere else in the code.

Consequently, this code is why the daemon forks on reload, because it re-execs itself, causing this code to be hit. When run with -D, there's no call to daemon.

The call to daemon should only chdir to / and close stdin/stdout/stderr, redirecting them to /dev/null. Given that upstart will already have done that, this is a non issue. The code then ioctl's on stdin to get rid of the controlling TTY, which also has no effect when run via upstart since it is connected to /dev/null. I believe this would change the behavior if we were using console owner, and sshd would exit on control-C in console if one was using that, but we're not going to do that, nor would be a reason to do that for sshd.

I tested this on natty and maverick, killing the daemon in various ways, making sure that user sessions stay alive, and that reloads keep track of the daemon. I'm confident that this change is safe and will not cause any operational issues with openssh-server.

I've submitted a merge proposal adding -D and dropping expect fork, which would fix this bug.

Changed in openssh (Ubuntu):
assignee: nobody → Clint Byrum (clint-fewbar)
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 687535] Re: upstart loses track of ssh daemon after reload ssh
Download full text (4.4 KiB)

Not using "expect fork" of course would mean that ssh would be reported as
running before it was actually listening for connections.

On Mon, Dec 13, 2010 at 10:13 PM, Clint Byrum <email address hidden> wrote:

> In looking at the code in OpenSSH, the -D flag only affects this portion
> of sshd.c, around line 1740:
>
>
> /*
> * If not in debugging mode, and not started from inetd, disconnect
> * from the controlling terminal, and fork. The original process
> * exits.
> */
> if (!(debug_flag || inetd_flag || no_daemon_flag)) {
> #ifdef TIOCNOTTY
> int fd;
> #endif /* TIOCNOTTY */
> if (daemon(0, 0) < 0)
> fatal("daemon() failed: %.200s", strerror(errno));
>
> /* Disconnect from the controlling tty. */
> #ifdef TIOCNOTTY
> fd = open(_PATH_TTY, O_RDWR | O_NOCTTY);
> if (fd >= 0) {
> (void) ioctl(fd, TIOCNOTTY, NULL);
> close(fd);
> }
> #endif /* TIOCNOTTY */
> }
>
> "no_daemon_flag" isn't used anywhere else in the code.
>
> Consequently, this code is why the daemon forks on reload, because it
> re-execs itself, causing this code to be hit. When run with -D, there's
> no call to daemon.
>
> The call to daemon should only chdir to / and close stdin/stdout/stderr,
> redirecting them to /dev/null. Given that upstart will already have done
> that, this is a non issue. The code then ioctl's on stdin to get rid of
> the controlling TTY, which also has no effect when run via upstart since
> it is connected to /dev/null. I believe this would change the behavior
> if we were using console owner, and sshd would exit on control-C in
> console if one was using that, but we're not going to do that, nor would
> be a reason to do that for sshd.
>
> I tested this on natty and maverick, killing the daemon in various ways,
> making sure that user sessions stay alive, and that reloads keep track
> of the daemon. I'm confident that this change is safe and will not cause
> any operational issues with openssh-server.
>
> I've submitted a merge proposal adding -D and dropping expect fork,
> which would fix this bug.
>
> ** Changed in: openssh (Ubuntu)
> Assignee: (unassigned) => Clint Byrum (clint-fewbar)
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/687535
>
> Title:
> upstart loses track of ssh daemon after reload ssh
>
> Status in Upstart:
> New
> Status in “openssh” package in Ubuntu:
> Confirmed
> Status in “openssh” source package in Lucid:
> Confirmed
> Status in “openssh” source package in Maverick:
> Confirmed
>
> Bug description:
> When sshd gets a signal 1 for reload, it forks a new process and ditches
> the old. This causes upstart to believe that ssh has crashed, and loses
> track of it. A second reload (or any other initctl operation on ssh) will
> thus say:
>
> reload: Unknown instance:
>
> There would be 2 ways to fix this:
> 1. Don't have ssh fork on relod, but keep the same pid
> 2. Use a different mechanism in upstart to keep track of ssh. Maybe a pid
> file? Just tracking children of the exited ssh won't work, or it might
> accide...

Read more...

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

The code looks something like this:

[ option parsing, config check, etc ]
sshd.c, line 1744: daemon()
[ reinit logs, start random number generator, chdir to /, ignore SIGPIPE ]
sshd.c, line 1774: server_listen() -- socket(), bind(), listen(), etc.

So, there's already a race between considering it started (at the fork caused by daemon()), and sshd listening, right? Its just that now we're considering it started before the config check and option parsing, so that does give any dependent upstart jobs a bigger head start than they already had.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Ah, then ssh already had that race.

In which case -D seems like a good choice

On Mon, Dec 13, 2010 at 11:05 PM, Clint Byrum <email address hidden> wrote:

> The code looks something like this:
>
> [ option parsing, config check, etc ]
> sshd.c, line 1744: daemon()
> [ reinit logs, start random number generator, chdir to /, ignore SIGPIPE ]
> sshd.c, line 1774: server_listen() -- socket(), bind(), listen(), etc.
>
> So, there's already a race between considering it started (at the fork
> caused by daemon()), and sshd listening, right? Its just that now we're
> considering it started before the config check and option parsing, so
> that does give any dependent upstart jobs a bigger head start than they
> already had.
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/687535
>
> Title:
> upstart loses track of ssh daemon after reload ssh
>
> Status in Upstart:
> New
> Status in “openssh” package in Ubuntu:
> Confirmed
> Status in “openssh” source package in Lucid:
> Confirmed
> Status in “openssh” source package in Maverick:
> Confirmed
>
> Bug description:
> When sshd gets a signal 1 for reload, it forks a new process and ditches
> the old. This causes upstart to believe that ssh has crashed, and loses
> track of it. A second reload (or any other initctl operation on ssh) will
> thus say:
>
> reload: Unknown instance:
>
> There would be 2 ways to fix this:
> 1. Don't have ssh fork on relod, but keep the same pid
> 2. Use a different mechanism in upstart to keep track of ssh. Maybe a pid
> file? Just tracking children of the exited ssh won't work, or it might
> accidentally track a particular session rather than the master, if somebody
> just happens to log in close to reload time.
>
>
> # lsb_release -rd
> Description: Ubuntu 10.04.1 LTS
> Release: 10.04
>
> # dpkg -l openssh-server | cat
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name Version
> Description
>
> +++-===========================================-==========================================================-=================================================================
> ii openssh-server 1:5.3p1-3ubuntu4
> secure shell (SSH) server, for secure access
> from remote machines
>
> # dpkg -l upstart
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name Version Description
>
> +++-==============-==============-============================================
> ii upstart 0.6.5-7 event-based init daemon
>
>
>

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openssh - 1:5.6p1-2ubuntu3

---------------
openssh (1:5.6p1-2ubuntu3) natty; urgency=low

  [ Clint Byrum ]
  * debian/openssh-server.ssh.upstart: drop 'expect fork' and run sshd
    with -D to avoid losing track on reload (LP: #687535)

  [ Colin Watson ]
  * Fix up mention-ssh-keygen-on-keychange.patch to match what was actually
    applied to the source tree.
 -- Colin Watson <email address hidden> Tue, 14 Dec 2010 11:52:43 +0000

Changed in openssh (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Imre Gergely (cemc) wrote :

I'm guessing this is waiting for a SRU. Here's the patch for Lucid and Maverick. Description updated for SRU. Please somebody add some notes to 'REGRESSION POTENTIAL' because I have no idea about that...

description: updated
Revision history for this message
Imre Gergely (cemc) wrote :
Revision history for this message
Imre Gergely (cemc) wrote :
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Thanks Imre!

I've updated the 'regression-potential' and subscribed ubuntu-sru and ubuntu-sponsors.

tags: added: patch
description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Imre, the debdiffs need different changelog lines for SRU's

openssh (1:5.3p1-3ubuntu4.1) lucid-proposed; urgency=low

and

openssh (1:5.5p1-4ubuntu4.1) maverick-proposed; urgency=low

Otherwise they look good to me.

Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks, Imre! I've sponsored these two patches; they're waiting in the SRU queue for approval now.

Changed in openssh (Ubuntu Lucid):
status: Confirmed → In Progress
Changed in openssh (Ubuntu Maverick):
status: Confirmed → In Progress
Revision history for this message
Benjamin Drung (bdrung) wrote :

Colin, you forgot to unsubscribe ubuntu-sponsors.

Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Accepted openssh into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in openssh (Ubuntu Maverick):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted openssh into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in openssh (Ubuntu Lucid):
status: In Progress → Fix Committed
Revision history for this message
Simon Déziel (sdeziel) wrote :

I confirm that this bug is fixed on Lucid.

$ lsb_release -rd
Description: Ubuntu 10.04.1 LTS
Release: 10.04

$ apt-cache policy openssh-server
openssh-server:
  Installed: 1:5.3p1-3ubuntu5
  Candidate: 1:5.3p1-3ubuntu5
  Version table:
 *** 1:5.3p1-3ubuntu5 0
        500 http://archive.ubuntu.com/ubuntu/ lucid-proposed/main Packages
        100 /var/lib/dpkg/status
     1:5.3p1-3ubuntu4 0
        500 http://ca.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     1:5.3p1-3ubuntu3 0
        500 http://ca.archive.ubuntu.com/ubuntu/ lucid/main Packages

Martin Pitt (pitti)
tags: added: verification-done
Revision history for this message
Colin Watson (cjwatson) wrote :

Benjamin, no, I didn't - I sponsored it myself and it wasn't necessary to add it to the general sponsoring queue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openssh - 1:5.3p1-3ubuntu5

---------------
openssh (1:5.3p1-3ubuntu5) lucid-proposed; urgency=low

  * debian/openssh-server.ssh.upstart: drop 'expect fork' and run sshd
    with -D to avoid losing track on reload (LP: #687535)
 -- Imre Gergely <email address hidden> Fri, 07 Jan 2011 10:19:13 +0000

Changed in openssh (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

Released for lucid (thanks, Simon!); still needs verification for maverick.

tags: removed: verification-done
Revision history for this message
Imre Gergely (cemc) wrote :

Confirmed fixed on Maverick.

root@utest-mms32:~# lsb_release -rd
Description: Ubuntu 10.10
Release: 10.10

root@utest-mms32:~# apt-cache policy openssh-server
openssh-server:
  Installed: 1:5.5p1-4ubuntu5
  Candidate: 1:5.5p1-4ubuntu5
  Version table:
 *** 1:5.5p1-4ubuntu5 0
        500 http://ro.archive.ubuntu.com/ubuntu/ maverick-proposed/main i386 Packages
        100 /var/lib/dpkg/status
     1:5.5p1-4ubuntu4 0
        500 http://ro.archive.ubuntu.com/ubuntu/ maverick/main i386 Packages

Martin Pitt (pitti)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openssh - 1:5.5p1-4ubuntu5

---------------
openssh (1:5.5p1-4ubuntu5) maverick-proposed; urgency=low

  * debian/openssh-server.ssh.upstart: drop 'expect fork' and run sshd
    with -D to avoid losing track on reload (LP: #687535)
 -- Imre Gergely <email address hidden> Fri, 07 Jan 2011 10:23:21 +0000

Changed in openssh (Ubuntu Maverick):
status: Fix Committed → Fix Released
Revision history for this message
Jan de Haan (jdehaan) wrote :
Download full text (4.9 KiB)

I'm sorry to spoil the fun but I see the same problem on Lucid with the new package.

Only thing that might be of interest is that this is a xen 4 guest.

More info and tests available on request.

root@somehost:/var/log >(zcat daemon.log.3.gz ; zcat daemon.log.2.gz ; cat daemon.log.1 ; cat daemon.log)|grep 'somehost init: ssh'|more
Jan 11 09:24:36 somehost init: ssh main process (3466) terminated with status 255
Jan 11 09:30:12 somehost init: ssh main process (3799) terminated with status 255
Jan 11 11:07:20 somehost init: ssh main process (3971) terminated with status 255
Jan 11 20:44:30 somehost init: ssh main process (369) terminated with status 255
Jan 11 21:54:04 somehost init: ssh main process (423) terminated with status 255
Jan 11 22:08:05 somehost init: ssh main process (1573) terminated with status 255
Jan 15 22:23:03 somehost init: ssh main process (1859) terminated with status 255
Jan 16 18:05:26 somehost init: ssh main process (400) terminated with status 255
Jan 23 20:04:04 somehost init: ssh main process (458) terminated with status 255
Jan 26 14:34:56 somehost init: ssh main process (4856) terminated with status 255
Jan 26 14:35:59 somehost init: ssh main process (21968) terminated with status 255
Jan 30 01:36:19 somehost init: ssh main process (396) terminated with status 255
Jan 30 13:48:46 somehost init: ssh main process (502) terminated with status 255
Jan 31 15:44:56 somehost init: ssh main process (419) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process (379) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (389) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (393) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (406) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (417) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (421) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (424) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (427) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (430) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (433) terminated with status 255
Feb 2 12:01:33 somehost init: ssh main process ended, respawning
Feb 2 12:01:33 somehost init: ssh main process (436) terminated with status 255
Feb 2 12:01:33 somehost init: ssh respawning too fast, stopped
Feb 2 16:39:02 somehost init: ssh main process (357) terminated with status 255
Feb 2 16:39:02 somehost i...

Read more...

Revision history for this message
Jan de Haan (jdehaan) wrote :
Download full text (3.3 KiB)

Hrmmm....

Feb 2 12:01:33 somehost sshd[379]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[379]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[379]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[389]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[389]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[389]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[393]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[393]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[393]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[406]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[406]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[406]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[417]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[417]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[417]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[421]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[421]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[421]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[424]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[424]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[424]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[427]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[427]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[427]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[430]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[430]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[430]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[433]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[433]: error: Bind to port 22 on 5555:666:77:8888::1 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[433]: fatal: Cannot bind any address.
Feb 2 12:01:33 somehost sshd[436]: error: Bind to port 22 on 1.2.3.4 failed: Cannot assign requested address.
Feb 2 12:01:33 somehost sshd[436]: error: Bind to port 22...

Read more...

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Jan, do you have a pidfile for sshd in /var/run ?

If so, you may have hit bug #531912 by running the init.d script.

Try

/etc/init.d/ssh stop

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I'm closing the upstart bug task as Invalid.

Upstart is doing what it was designed to do, following fork()'d pids. This method just can't handle things that re-fork arbitrarily.

Changed in upstart:
status: New → Invalid
Revision history for this message
Steve Langasek (vorlon) wrote :

I'm not convinced this is true, Clint. Should upstart jobs be allowed to spawn processes that then get orphaned from the POV of the process supervisor? Maybe upstart shouldn't be tracking the new PID, but maybe it should instead be reaping any children left behind?

Revision history for this message
Scott James Remnant (scott) wrote : Re: [Bug 687535] Re: upstart loses track of ssh daemon after reload ssh
Download full text (4.0 KiB)

They definitely should be allowed, ssh is actually a canon example of why.

Upstart should supervise the sshd daemon, not the login sub-process
associated with a particular connection, and certainly not any
processes being run inside the login session.

Otherwise "stop ssh" would kill all user logins; and ssh would not
respawn on crash if there was still a used logged in (or running
screen!)

Scott

On Mon, Feb 7, 2011 at 5:51 PM, Steve Langasek
<email address hidden> wrote:
> I'm not convinced this is true, Clint.  Should upstart jobs be allowed
> to spawn processes that then get orphaned from the POV of the process
> supervisor?  Maybe upstart shouldn't be tracking the new PID, but maybe
> it should instead be reaping any children left behind?
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/687535
>
> Title:
>  upstart loses track of ssh daemon after reload ssh
>
> Status in Upstart:
>  Invalid
> Status in “openssh” package in Ubuntu:
>  Fix Released
> Status in “openssh” source package in Lucid:
>  Fix Released
> Status in “openssh” source package in Maverick:
>  Fix Released
>
> Bug description:
>  When sshd gets a signal 1 for reload, it forks a new process and
>  ditches the old. This causes upstart to believe that ssh has crashed,
>  and loses track of it. A second reload (or any other initctl operation
>  on ssh) will thus say:
>
>  reload: Unknown instance:
>
>  There would be 2 ways to fix this:
>  1.  Don't have ssh fork on relod, but keep the same pid
>  2. Use a different mechanism in upstart to keep track of ssh. Maybe a pid file? Just tracking children of the exited ssh won't work, or it might accidentally track a particular session rather than the master, if somebody just happens to log in close to reload time.
>
>  openssh-server  1:5.3p1-3ubuntu4
>  upstart         0.6.5-7
>
>  ==== Info for Maverick, Lucid SRU ====
>  IMPACT: if sshd gets a HUP signal, it forks a new process and upstart thinks the process died and loses track of it, so the user/admin uses the ability to stop/start/reload the daemon through upstart.
>  The problem is fixed in Natty 5.6p1-2ubuntu3. See attached patches for Maverick and Lucid.
>
>  TEST CASE:
>
>  - install openssh-server
>  - send a HUP signal to sshd
>  - the daemon is restarted, but upstart thinks that it crashed (/var/log/daemon.log):
>
>  Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
>  Dec 28 20:59:57 utest-lls32 init: ssh main process (1451) terminated with status 255
>  Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
>  Dec 28 20:59:57 utest-lls32 init: ssh main process (1455) terminated with status 255
>  Dec 28 20:59:57 utest-lls32 init: ssh respawning too fast, stopped
>
>  - after this, upstart won't know about sshd, despite the daemon
>  running just fine:
>
>  root@utest-lls32:~# reload ssh
>  reload: Unknown instance:
>
>  With the fix applied, the correct behavior is:
>
>  - send a HUP signal to sshd
>    ps ax |grep sshd
>    kill -HUP sshd
>  - the daemon reloads (/var/log/auth.log):
>
>  Dec 28 21:37:01 utest-lls...

Read more...

Revision history for this message
Steve Langasek (vorlon) wrote :

On Tue, Feb 08, 2011 at 04:14:59AM -0000, Scott James Remnant wrote:
> They definitely should be allowed, ssh is actually a canon example of
> why.

> Upstart should supervise the sshd daemon, not the login sub-process
> associated with a particular connection, and certainly not any
> processes being run inside the login session.

> Otherwise "stop ssh" would kill all user logins; and ssh would not
> respawn on crash if there was still a used logged in (or running
> screen!)

Good point, comment withdrawn. :)

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message
Scott James Remnant (scott) wrote :
Download full text (4.4 KiB)

That being said, the current process tracker is very much not ideal.

I'm working on a new one, and will probably post in this week in the
form of a test program to be run and played with.

Scott

On Mon, Feb 7, 2011 at 8:48 PM, Steve Langasek
<email address hidden> wrote:
> On Tue, Feb 08, 2011 at 04:14:59AM -0000, Scott James Remnant wrote:
>> They definitely should be allowed, ssh is actually a canon example of
>> why.
>
>> Upstart should supervise the sshd daemon, not the login sub-process
>> associated with a particular connection, and certainly not any
>> processes being run inside the login session.
>
>> Otherwise "stop ssh" would kill all user logins; and ssh would not
>> respawn on crash if there was still a used logged in (or running
>> screen!)
>
> Good point, comment withdrawn. :)
>
> --
> Steve Langasek                   Give me a lever long enough and a Free OS
> Debian Developer                   to set it on, and I can move the world.
> Ubuntu Developer                                    http://www.debian.org/
> <email address hidden>                                     <email address hidden>
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/687535
>
> Title:
>  upstart loses track of ssh daemon after reload ssh
>
> Status in Upstart:
>  Invalid
> Status in “openssh” package in Ubuntu:
>  Fix Released
> Status in “openssh” source package in Lucid:
>  Fix Released
> Status in “openssh” source package in Maverick:
>  Fix Released
>
> Bug description:
>  When sshd gets a signal 1 for reload, it forks a new process and
>  ditches the old. This causes upstart to believe that ssh has crashed,
>  and loses track of it. A second reload (or any other initctl operation
>  on ssh) will thus say:
>
>  reload: Unknown instance:
>
>  There would be 2 ways to fix this:
>  1.  Don't have ssh fork on relod, but keep the same pid
>  2. Use a different mechanism in upstart to keep track of ssh. Maybe a pid file? Just tracking children of the exited ssh won't work, or it might accidentally track a particular session rather than the master, if somebody just happens to log in close to reload time.
>
>  openssh-server  1:5.3p1-3ubuntu4
>  upstart         0.6.5-7
>
>  ==== Info for Maverick, Lucid SRU ====
>  IMPACT: if sshd gets a HUP signal, it forks a new process and upstart thinks the process died and loses track of it, so the user/admin uses the ability to stop/start/reload the daemon through upstart.
>  The problem is fixed in Natty 5.6p1-2ubuntu3. See attached patches for Maverick and Lucid.
>
>  TEST CASE:
>
>  - install openssh-server
>  - send a HUP signal to sshd
>  - the daemon is restarted, but upstart thinks that it crashed (/var/log/daemon.log):
>
>  Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
>  Dec 28 20:59:57 utest-lls32 init: ssh main process (1451) terminated with status 255
>  Dec 28 20:59:57 utest-lls32 init: ssh main process ended, respawning
>  Dec 28 20:59:57 utest-lls32 init: ssh main process (1455) terminated with status 255
>  Dec 28 20:59:57 utest-lls32 init: ssh respawning too fast, sto...

Read more...

Revision history for this message
Mark Goris (mgoris) wrote :

I just updated to the openssh-server_5.3p1-3ubuntu5 package for a Lucid instance, and now I can't get SSH to start. Or, rather, I'm running into what Jan mentioned above, with the respawning/terminating loop that eventually results in init deciding that SSH isn't going to start, and giving up. Something like:

init: ssh main process (430) terminated with status 255
init: ssh main process ended, respawning
init: ssh main process (449) terminated with status 255
init: ssh main process ended, respawning
init: ssh main process (456) terminated with status 255
init: ssh main process ended, respawning
...
init: ssh main process (468) terminated with status 255
init: ssh respawning too fast, stopped

The real trouble here is, this is an EC2 instance that was just restarted, so now I have no means by which I can get on that machine to do any additional troubleshooting or ultimately fix this. But this was not caused by me executing an /etc/init.d/ssh stop, so I don't think this is bug #531912.

So my question would be what now? Downgrade upstart (I just upgraded to 0.6.5-8)? Downgrade openssh-server (I just upgraded to openssh-server_5.3.01-3ubuntu5? Is there something I can do to workaround this? My existing instance my be lost, but what will keep me from running into this again if I build a new EC2 instance with the same package versions?

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

On Wed, 2011-02-23 at 05:19 +0000, gorism wrote:
> I just updated to the openssh-server_5.3p1-3ubuntu5 package for a Lucid
> instance, and now I can't get SSH to start. Or, rather, I'm running
> into what Jan mentioned above, with the respawning/terminating loop that
> eventually results in init deciding that SSH isn't going to start, and
> giving up. Something like:
>
> init: ssh main process (430) terminated with status 255
> init: ssh main process ended, respawning
> init: ssh main process (449) terminated with status 255
> init: ssh main process ended, respawning
> init: ssh main process (456) terminated with status 255
> init: ssh main process ended, respawning
> ...
> init: ssh main process (468) terminated with status 255
> init: ssh respawning too fast, stopped
>
> The real trouble here is, this is an EC2 instance that was just
> restarted, so now I have no means by which I can get on that machine to
> do any additional troubleshooting or ultimately fix this. But this was
> not caused by me executing an /etc/init.d/ssh stop, so I don't think
> this is bug #531912.
>

If you restarted the instance and this is happening, its definitely not
bug #531912, as that one still results in an sshd that works on bootup.

If this is an EBS root instance, can you snapshot it and mount it from
another instance? If so, you should be able to view the contents
of /var/log and paste any errors from sshd. If you can't find anything
there, you can mount the volume r/w and add 'console output'
to /etc/init/ssh.conf and then check the console output when it fails to
start up.

If its fully instance store, I'm not sure there's much you can do though
EC2 experts may have some tricks.

Revision history for this message
Mark Goris (mgoris) wrote :

It is an EBS root instance, and I'll be spinning up a replacement instance as a next step. I'll mount a snapshot of root, check out /var/log, and let you know what I find.

Revision history for this message
Mark Goris (mgoris) wrote :

We did as described above, and the log files did not yield any helpful information. My next step is to revert back to openssh_5.3p1-3ubuntu4, since that seemed to be reliable.

As for the fact that this didn't show up in any testing prior to committing the patch, I would point out that, just like Jan above, our instances are running as Xen guests (since we're running in EC2). So perhaps that has something to do with it.

I can't tell from the bug comments, but I'm curious as to whether or not these sort of changes get tested in a Xen based environment. My guess is that their not, at least not consistently.

tags: added: testcase
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.