upstart fails to start system into multiuser mode

Bug #506727 reported by Anand Kumria
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Unfortunately I am not able to get upstart to start my system.

Initially this was because of #505530, but with the supplied patch applied, mountall succeeds and I am no longer dropped into 'mountall-shell.conf'

Instead the system boots, Upstart runs but it never gets to the point where it goes to runlevel 2 (i.e. start rc RUNLEVEL=2).

Unfortunately I do not know of any way I can capture the output of '/sbin/init -v --debug' (which might have some clues); even booting the system with 'init=/bin/bash' and then trying to redirect things does not work.

The only way I have been able to bring my system up is by:
 - booting with 'init=/bin/bash'
 - modifying /etc/init/mountall.conf and placing in its 'post-script script' the line 'start tty2'
 - logging in and running X (manually in failsafe mode, normal mode does not work).

The kernel version is 2.6.32-9-generic (2.6.32-10-generic fails to boot) and upstart is 0.6.3-11

What would be really useful is a command line option to tell you what upstart (init) would run, in what order, but not actually do it. What would also be useful is something to indicate which jobs had already been run.

Any debugging assistance would be greatly appreciated.

Revision history for this message
bmarsh (bmarsh-bmarsh) wrote :

The most recent update to upstart (sorry, don't have a version because I had to back it off my system) may be a result of the above problem. I am running 9.10 with kubuntu loaded but having place XFCE on top of that.

When the system boots if ALL the latest updates as of 1-13-2010, many services fail to start but I can get into XFCE, meaning that KDM did get started. Networking also started but sshd, postfix, and many other services fail to start. And I cannot under these conditions get out of XFCE and into a real console session. All I get is a cursor, no login prompt.

I have narrowed the problem down to just the recent UPSTART update. All other updates including kernel 2.6.31-17 run just fine.

Revision history for this message
Johan Kiviniemi (ion) wrote :

Please create /etc/init/sulogin.conf from http://upstart.ubuntu.com/wiki/OMGBroken and boot the system. When the startup hangs, go to the sulogin virtual console and run ‘ifup lo’.

If this triggers a successful system startup, there’s a problem getting the lo interface automatically up that requires investigation.

Revision history for this message
Johan Kiviniemi (ion) wrote :

(The latest upstart version makes the rc-sysinit job wait for the lo interface.)

Revision history for this message
Anand Kumria (wildfire) wrote :

Hi Johan,

I've done the whole sulogin thing -- the 'lo' device might not be brought up correctly.

But that is not the problem here, as 'ifup lo' changes nothing.

Here is what I end up with for 'lo':

eve:[~]% ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

This is what should have occurred.

eve:[~]% ifup --no-act --verbose --force lo
Configuring interface lo=lo (inet)
run-parts --verbose /etc/network/if-pre-up.d
ip link set dev lo up
ip addr add 127.0.0.2 dev lo
run-parts --verbose /etc/network/if-up.d

So, when I do: 'ifdown lo' and 'ifup lo', I get:

eve:[~]% ifdown lo && ifup lo
RTNETLINK answers: No such process
run-parts: /etc/network/if-down.d/multicast-down exited with return code 2

eve:[~]% ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet 127.0.0.2/32 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

Any other suggestions?

Anand

Changed in upstart:
status: New → Confirmed
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

This does not appear to be an upstream upstart bug, so removing from its bug tracker and reassigning to Ubuntu

affects: upstart → null
Changed in null:
status: Confirmed → Invalid
Anand Kumria (wildfire)
Changed in ubuntu:
assignee: nobody → Upstart Developers (upstart-devel)
Johan Kiviniemi (ion)
Changed in ubuntu:
assignee: Upstart Developers (upstart-devel) → nobody
Revision history for this message
Anand Kumria (wildfire) wrote :

Confirming. A bit tired of upstream not wanting to assist.

Changed in ubuntu:
status: New → Confirmed
Revision history for this message
Anand Kumria (wildfire) wrote :

Note - it actually does fail in the direct upstream project (yes, I did try after Scott re-assigned to it Null without asking me any further information) but I do not have the energy to re-assign back to the upstart upstream project.

Revision history for this message
Johan Kiviniemi (ion) wrote :

Nothing so far indicates you’re encountering a bug in Upstart.

The problem seems to lie with something in Ubuntu that’s supposed to get the lo interface up properly. Upstart’s upstream code has nothing to do with this.

Revision history for this message
Anand Kumria (wildfire) wrote : Re: [Bug 506727] Re: upstart fails to start system into multiuser mode

Hmm, well, as well as not being able to startup. I also find I can not shutdown!

I can switch to a console and press ctrl-alt-del and (I exec'd init
with -v --debug) see a lot of messages fly by; and then something
stops upstart from proceeding to shutdown the system.

Likewise I can press the power button, a lot of message and the system
is still operational.

It seems that upstart is waiting for some kind of lock / event to
occur before allowing the shutdown to finish. My guess is that the
same lock is preventing successful startup.

I am stumped asto how to debug this at all; I'm not even able to log
the messages that upstart sends out. Redirection does not seem to be
working from the point of sulogin.

Is there a logging option, or anything further I can use?

Thanks,
Anand

On Tue, Jan 19, 2010 at 2:06 PM, Johan Kiviniemi
<email address hidden> wrote:
> Nothing so far indicates you’re encountering a bug in Upstart.
>
> The problem seems to lie with something in Ubuntu that’s supposed to get
> the lo interface up properly. Upstart’s upstream code has nothing to do
> with this.
>
> --
> upstart fails to start system into multiuser mode
> https://bugs.launchpad.net/bugs/506727
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in NULL Project: Invalid
> Status in Ubuntu: Confirmed
>
> Bug description:
>
> Unfortunately I am not able to get upstart to start my system.
>
> Initially this was because of #505530, but with the supplied patch applied, mountall succeeds and I am no longer dropped into 'mountall-shell.conf'
>
> Instead the system boots, Upstart runs but it never gets to the point where it goes to runlevel 2 (i.e. start rc RUNLEVEL=2).
>
> Unfortunately I do not know of any way I can capture the output of '/sbin/init -v --debug' (which might have some clues); even booting the system with 'init=/bin/bash' and then trying to redirect things does not work.
>
> The only way I have been able to bring my system up is by:
>  - booting with 'init=/bin/bash'
>  - modifying /etc/init/mountall.conf and placing in its 'post-script script' the line 'start tty2'
>  - logging in and running X (manually in failsafe mode, normal mode does not work).
>
> The kernel version is 2.6.32-9-generic (2.6.32-10-generic fails to boot) and upstart is 0.6.3-11
>
> What would be really useful is a command line option to tell you what upstart (init) would run, in what order, but not actually do it. What would also be useful is something to indicate which jobs had already been run.
>
> Any debugging assistance would be greatly appreciated.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/null/+bug/506727/+subscribe
>
>

Anand Kumria (wildfire)
affects: ubuntu → mountall (Ubuntu)
Revision history for this message
Anand Kumria (wildfire) wrote :

Some progress on this bug. With the (recent) update the moving the png libraries into /lib (for Plymouth), I've been gettings things going further.

Unfortunately tonight it all broke again -- so I decided to figure things out as much as possible.

I stripped /etc/init/ to a bare minimum: that provided by upstart and just mountall.conf

i.e.

eve:[~]% ls /etc/init/
control-alt-delete.conf rc-sysinit.conf rcS.conf tty2.conf tty4.conf tty6.conf
mountall.conf rc.conf tty1.conf tty3.conf tty5.conf upstart-udev-bridge.conf

And I still could get booting to work.

I modified tty2.conf to start on starting (like mountall.conf); and modified the mountall.conf job and prefixed it with strace. It appears that upstart does not like ptracing though strace and it hangs the job.

That was lucky as I manually went and run the mountall program via strace (attached).

From what I see, mountall -- even though it does mount the 'local' filesystems does not actually regard them as being mounted. So it never emits any signal saying 'filesystem' and nothing else runs.

For some strange reason, even doing a manual "initctl emit filesystem" hangs too. Also, Plymouth has actually failed (don't know why but not my focus for this) to start.

Revision history for this message
Anand Kumria (wildfire) wrote :

Some progress on this bug. With the (recent) update the moving the png libraries into /lib (for Plymouth), I've been gettings things going further.

Unfortunately tonight it all broke again -- so I decided to figure things out as much as possible.

I stripped /etc/init/ to a bare minimum: that provided by upstart and just mountall.conf

i.e.

eve:[~]% ls /etc/init/
control-alt-delete.conf rc-sysinit.conf rcS.conf tty2.conf tty4.conf tty6.conf
mountall.conf rc.conf tty1.conf tty3.conf tty5.conf upstart-udev-bridge.conf

And I still could get booting to work.

I modified tty2.conf to start on starting (like mountall.conf); and modified the mountall.conf job and prefixed it with strace. It appears that upstart does not like ptracing though strace and it hangs the job.

That was lucky as I manually went and run the mountall program via strace.

Revision history for this message
Anand Kumria (wildfire) wrote :

Oh, forgot to mention. This is with mountall 2.4.

Plus I would occassionally see 'boredom_timeout' occur -- but not when run under strace

Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :

With the recent upload of plymouth (0.8.0~8), I now get a display during bootup.

The last message plymouth displays before things cease is "Waiting for / [SM]"

Again, as far as I can tell mountall is failing to mount the local filesystems.

Revision history for this message
Johan Kiviniemi (ion) wrote :

Could you please get a new strace and a new output log now that plymouth is working better?

Revision history for this message
Anand Kumria (wildfire) wrote :

After some discussion with Johan, via IRC, he suggested another strace round since plymouth may have affected things.

I modified /etc/init/sulogin.conf and changed its 'start' stanza to:

eve:[~]% cat /etc/init/sulogin.conf
start on starting mountall
exec openvt -c 7 -w sulogin

Unfortunately this did not work, init would report:

"init: sulogin main process (335) terminated with status 7"

(335 is, I believe, the process id)

The intention was to have sulogin prevent mountall from running. And then perform an strace from start to finish.

As that did not happen, I would see messages like:
"mount: mounting none on /dev failed: No such device"

I have attached a 'ps aufxww', an strace, an lsof and the mountall.log from one of the boot sequences.

The same thing occurs under both a vanilla kernel and an Ubuntu one.

Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :
Revision history for this message
Anand Kumria (wildfire) wrote :

interestingly, it looks as if fsck wants to sends its output to file descriptor 13, which is not opened (as you can see from the lsof output)

Revision history for this message
Anand Kumria (wildfire) wrote :

After another IRC conversation with Johan, I disabled : sulogin.conf, and ran mountall under strace (after removing the daemon line).

Attached is the resulting strace.

Revision history for this message
Anand Kumria (wildfire) wrote :

 /dev/mountall.log

Revision history for this message
Johan Kiviniemi (ion) wrote :

Scott,

Am I reading this right – does boredom_timeout end up asking the user to either skip the mount or to exit to maintenance shell, without providing the choice to keep waiting?

IMO mountall should give the prompt to skip or exit asynchronously while continuing to wait for the device in question. If the device appears before the user does anything, the prompt should just be removed.

Revision history for this message
Anand Kumria (wildfire) wrote :

As a data point, I noticed that e2fsck was upgraded last night, so I touched '/forcefsck'. This resulted in mountall outputing:

trigger_events: local 5/5 remote 0/0 virtual 11/11 swap 1/1

(eventually)

So, it really does seem like things are timing related. I have a 155Mb strace output (since it followed the e2fsck processes) in case that is useful -- but I will note that it appears that the filesystem event was still not triggered.

Revision history for this message
Anand Kumria (wildfire) wrote :

Further information; I can cause things to advance if I manually do:

'initctl emit filesystem'

followed by

'initctl emit net-device-up IFACE=lo'

My issue looks like it is similiar to #497299 but that is related to karmic and not lucid as Kees Cook indicates.

(and like that bug, I do not have a file called /etc/init/network-interface.conf - nothing seems to contain that file according to packages.ubuntu.com).

Revision history for this message
Anand Kumria (wildfire) wrote :

Anything else I need to do to progress this?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Thanks for all your debugging work Anand, from the information you collected I believe that Johan's hunch was right and your bug was that mountall was "locking up" waiting for your root filesystem and never continuing once it appeared

This bug was fixed recently, so hopefully your problems should not be happening on current Lucid

Changed in mountall (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
James Blackwell (jblack) wrote :

Has the fix made it to lucid? I'm still seeing this behavior as of 5-31.

Curtis Hovey (sinzui)
no longer affects: null
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.