Ext4 corruption associated with shutdown of Ubuntu 12.10

Bug #1073433 reported by Ernie 07
330
This bug affects 59 people
Affects Status Importance Assigned to Milestone
upstart
Confirmed
Undecided
Unassigned
linux (Ubuntu)
Incomplete
High
Unassigned
network-manager (Ubuntu)
Fix Released
High
Unassigned
upstart (Ubuntu)
Confirmed
High
Unassigned

Bug Description

1. Format and label a target Ext4 partion using Ubuntu 12.04
2. Install 64bit 12.10 OS using that target without reformatting it
3. Shut down
4. Boot an alternate copy of Ubuntu
5. Restart selecting the newly installed OS
6. Login then shutdown
6. Boot an alternate copy of Ubuntu
7.Fsck the newly installed OS allowing corrections to be made

Each time the the newly installed OS is executed and then shutdown, even if execution only consists of logging on, a subsequent fsck will FAIL.

I used Acronis True Image Home 2013 to create an image of the newly installed 64-bit Ubuntu 12.10, so I can recreate the symptoms of Ext4 filesystem corruption 100% of the time by restoring from the image, booting, logging on and shutting down.

ProblemType: Bug
DistroRelease: Ubuntu 12.10
Package: linux-image-3.5.0-17-generic 3.5.0-17.28
ProcVersionSignature: Ubuntu 3.5.0-17.28-generic 3.5.5
Uname: Linux 3.5.0-17-generic x86_64
ApportVersion: 2.6.1-0ubuntu3
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: aguru 1871 F.... pulseaudio
 /dev/snd/controlC0: aguru 1871 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Tue Oct 30 22:24:54 2012
HibernationDevice: RESUME=UUID=f22e3fa5-c5c5-41f1-ae5a-49390547cb67
InstallationMedia: Ubuntu 12.10 "Quantal Quetzal" - Release amd64 (20121017.5)
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
MachineType: System manufacturer P5Q-E
ProcEnviron:
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.5.0-17-generic root=UUID=ef2c78d5-783a-422a-88f7-27ec09dda0d1 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.5.0-17-generic N/A
 linux-backports-modules-3.5.0-17-generic N/A
 linux-firmware 1.95
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/06/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2101
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: P5Q-E
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2101:bd04/06/2009:svnSystemmanufacturer:pnP5Q-E:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP5Q-E:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: P5Q-E
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Ernie 07 (ernestboyd) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.7 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. Please only remove that one tag and leave the other tags. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7-rc2-raring/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
importance: High → Critical
tags: added: kernel-da-key needs-upstream-testing
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Set importance to critical due to possible corruption.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Are you using any "Non-default" mount options?

Revision history for this message
Ernie 07 (ernestboyd) wrote :

In an effort to KISS and minimize regression testing, I reported a 100% repeatable bug. In my haste, I failed to indicate that the source iso used burn the LiveCD was the 64-bit version of Ubuntu 12.10 which was recently released to the public. After running the installation under the Try Ubuntu path, I performed a shutdown followed by a reboot of an alternate version (12.04) of Ubuntu. A fsck -vf of the recently installed (12.10) indicated problems and I followed the prompts to repair the Ext4 file system.

Acronis True Image Home 2013 was used to create an image which could be restored quickly.

To create the problem, I booted (12.10), logged in, waited a while (sometimes a few minutes) and then performed a shutdown followed by a reboot of an alternate version (12.04) of Ubuntu. A fsck -vf of the recently installed (12.10) indicated problems and I followed the prompts to repair the Ext4 file system.

It would seem to me that critical data can be obtained from a 100% repeatable problem in a "known" environment. The symptoms might be masked in a different version of the kernel although the problem still exists.

Changed in linux (Ubuntu):
importance: Critical → High
Revision history for this message
Bernd Schubert (aakef) wrote :

Ernie, I see a lot of log files here, but somehow e2fsck logs seem to be missing. Any chance you have captured e2fsck messages or could recreate those?
And I entirely agree with you, in my opionion just updating a recent stable kernel to a development version is not a real solution.

Thanks,
Bernd

Revision history for this message
Ernie 07 (ernestboyd) wrote :

When I checked /var/log/fsck, the two files appear unchanged from the original distribution on both the 12.10 and 12.04 OS's. I have attached a screenshot of the fsck output in case that would be helpful

Revision history for this message
Christian Niemeyer (christian-niemeyer) wrote :

Filesystem corruption after shutdown with a clean standard installation. 100% confirmation. 100% reproducable.

But I guess it's not ext4 related. It's a dbus/networking problem with the shutdown scripts. However nobody fixed it. Though it was still reported. Busy filesystem, busy scripts, unclean shutdown. Everytime.

My system: AMD64, wired Networking (forcedeth). Again, I think it's a dbus/networking/shutdown/upstart/initscripts problem. It get's triggered for some people. And it doesn't get triggered for others. That's the strange part about it.

References:
https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1061639
https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1058987 (it say 'Fix released', but I doubt it honestly.)

Revision history for this message
Christian Niemeyer (christian-niemeyer) wrote :

Shutdown filesystem corruption in 12.10 stops for me after doing: sudo apt-get remove --purge dnsmasq-base resolvconf wpasupplicant isc-dhcp-client isc-dhcp-common libnm-glib-vpn1 libnm-glib4 libnm-gtk-common libnm-gtk0 libnm-util2 network-manager network-manager-gnome ubuntu-minimal ntp plymouth-label plymouth-theme-lubuntu-logo plymouth-theme-lubuntu-text plymouth-theme-ubuntu-text mobile-broadband-provider-info blueman bluez lubuntu-core lubuntu-desktop modemmanager obex-data-server ppp pppconfig pppoeconf rfkill wvdial mlocate

(I guess mandatory are dnsmasq-base, resolconf, isc-dhcp-*, network-manager-*)

Revision history for this message
jim warner (warnerjc) wrote :

I too have had this problem since upgrading (not fresh installing) 12.10.

Under my wireless connection, when I uncheck "available to all users", for each of several users, I am able to shutdown cleanly.

Of course, upon reboot the "available", "not connected" and then "connected" messages are a bit anoying.

I hope my experience may provide additional clues to this bugs ultimate demise.

Revision history for this message
Daniel J Blueman (watchmaker) wrote :

This looks to be the same issue as I was experiencing during 12.10 development:

http://old.nabble.com/ext4-recovery-deleted-orphans-on-reboot...-td34475175.html

Journal recovery occurs 100% of the time; list of orhpan inodes presumably depends on the amount of unlinking in the last 5 seconds before shutdown. Oddly enough, I don't observe this on my work desktop running Ubuntu 12.10, but I do see this on three laptops - also with Ubuntu 12.04. I'll double-check this.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Those specific fsck corrections --- fixing the number of free blocks and the number of free inodes --- is completely normal and is purely a cosmetic issue. There is nothing to worry about here.

What is going on is that ext4 no longer updates the superblock after every block and inode allocation; that causes a wasteful write cycle to the superblock at every single journal commit, and it also is a SMP scalability bottleneck for larger servers (i.e., with 32 or 64 CPU's). To fix this, we no longer update these values in the superblock at every commit. Instead, we only update these values when we unmount the file system, mainly for cosmetic purposes so that dumpe2fs shoes the correct number of free inodes and blocks, and at mount time we calculate the total number of free blocks and inodes in the file system by summing the the free blocks/inodes statistics for each block group. So in fact, ext4 does not depend on the correctness of the values in the superblock, but it does try to update them on a clean unmount.

In e2fsprogs commit id 2788cc879bbe6, which is in e2fsprogs 1.42. 3 and newer, we changed things so that e2fsck -n would not display this as something "wrong". However, we still do show this as something that we "fix" when running e2fsck -y or -p, since in fact it is a change to the file systems. See: http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=commit;h=2788cc879bbe667d28277e1d660b7e56514e5b30

No one else has complained or noticed up until now, because other distro's apparently are capable of doing a clean shutdown allowing the file system to be unmounted cleanly. Ubuntu, unfortunately, is incapable of reliably doing a clean shutdown even when users request it, which is why Ubuntu users are seeing this behavior much more frequently, and apparently some people have panicked as a result. Sigh....

Revision history for this message
Ernie 07 (ernestboyd) wrote :

For my environment (Ethernet DSL to Internet) a temporary workaround follows:

1. UNCHECK Enable Networking
2. Wait until after the disconnected message goes away
3. Restart and Shutdown
4. Fsck from an alternate installation will NOT throw any errors.

Apparently Unchecking Enable Networking does something that a simple restart or shutdown does NOT in terms of preparing for a SAFE shutdown.

Revision history for this message
Daniel J Blueman (watchmaker) wrote :

Ok, I found this also on older desktops with rotational disks (all the four ones mentioned have SSDs) running Ubuntu 12.04.1.

As Ted points out, it looks like Ubuntu (Upstart?) has issues with shutdown, but could there be a race exposed by the superb speed that Upstart is executing the umount/remount-ro, disk-cache-flush and kernel-reboot vector sequence?

Revision history for this message
Luis Alvarado (luisalvarado) wrote :

Might this have to do with anything relating to NetworkManager not connecting automatically or not detecting any connections until I disable the "Enable Networking" option, wait a couple of seconds and enable it again. Same for Wireless.

Tested just in case it has something to do with it with Intel LAN Wired connections (Motherboards Intel DP35DP and Intel DZ68DB) and with Linksys WMP300N, Linksys WMP600 and Realtek Gigalan (Forgot model). All of them I need to "reset" the network like I mentioned above.

Revision history for this message
misiu_mp (misiu-mp) wrote :

To clarify as it is not completely apparent from the above discussion:
The repairs reported by fsck are not caused by corruption, but are harmless and purely cosmetic fixes. The reason is that to avoid performance bottlenecks, ext4 does not update the superblock after each inode or block (de)allocation. This is done on (clean) unmount instead and only to make it look good. The filesystem does not rely on this information.
The real bug is of course ubuntu not shutting down cleanly, and thus not performing the umount.

Then again if this is not an error in the fs, then maybe fsck shouldn't prevent the system from cleanly booting.

Theodore Ts'o take on it:
https://plus.google.com/117091380454742934025/posts/JmpczpdwgrQ

Revision history for this message
misiu_mp (misiu-mp) wrote :

Hmm, somehow I missed that the actual Theodore Ts'o already commented on this here. Oops.

Still though , if this is not an error in the fs, then fsck shouldn't prevent the system from cleanly booting.

Revision history for this message
Ernie 07 (ernestboyd) wrote :

I have attempted to focus on a repeatable error condition. Essentially, a fsck via an alternate copy of Ubuntu would ALWAYS produce errors following simple behavior. Boot, logon, shutdown.

If I keep the Enable Network option unchecked, the error NEVER occurs. Therefore it seems reasonable to conclude that networking functionally is broken.

Revision history for this message
Francisco Reverbel (reverbel) wrote :

I ran into this bug and confirm it that the suggested workaround is effective. The problem does not show up if I uncheck "Enable Networking" before shutting the system down.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
xavier vilajosana (xvilajosana) wrote :

I'm facing the same problem. Then I've read that by disabling networking before turning off the computer the problem disappears, however, after two days of turning turning off networking service and everything working fine, I've ran in the same issue, even having disabled networking.

I am using gnome-shell and I don't have a button to disable networking but I switch the service off instead
sudo service networking stop
right before shutting down the system.

several errors appear when I stop the networking service:

1-gnome-shell crashes and I lose the window borders including buttons to close them.
2-Also the top bar disappears and hence I need to stop the machine by
sudo shutdown now.

this process fails when trying to stop services and logs me in into a root terminal.
to stop the machine I have to

reboot now

and turn off the machine manually when in starts.

It is becoming urgent to solve that issue as every 2 days I need to boot with a usb drive and force a superblock correction using that tutorial.

http://www.cyberciti.biz/faq/recover-bad-superblock-from-corrupted-partition/

I've been using Ubuntu since 7.04 and I am at one step to completely switch to another O.S. I cannot work with so much problems.

Revision history for this message
Ernie 07 (ernestboyd) wrote :

The workaround of disabling networking becomes unavailable EACH time POORLY maintained Nvidia drivers randomly cause 12.10 to crash requiring a power cycle to recover.

Will this BUG be fixed before 13.04 or should I AVOID 12.10 and continue to use 12.04?

Revision history for this message
Daniel J Blueman (watchmaker) wrote :

@Joseph

I've tested with various v3.6 and v3.7 mainline kernel, along with Ubuntu kernels, all with defaults mount options; I still observe unclean filesystem messages:

$ dmesg
...
EXT4-fs (sda2): INFO: recovery required on readonly filesystem
EXT4-fs (sda2): write access will be enabled during recovery
...
EXT4-fs (sda2): recovery complete

Users users likely mis-correlate NetworkManager as the issue, since it changes the upstart race condition timing; most likely, this is an upstart issue, as I believe the kernel has the correct behaviour, thus it would be inappropriate to add the "kernel-bug-exists-upstream" tag.

What's next?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: removed: needs-upstream-testing
Revision history for this message
Daniel J Blueman (watchmaker) wrote :

We need to remove the network-manager project association, as it is just circumstantial.

Changed in network-manager (Ubuntu):
status: Confirmed → Invalid
Changed in upstart:
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Add upstart task for Ubuntu.

Changed in upstart (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Has anyone affected by this bug had a chance to test 13.04(Raring)? It would be good to know if this issue exists there as well, or if it is limited to 12.10.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in upstart (Ubuntu):
status: New → Confirmed
Revision history for this message
Richard Samson (richard) wrote :

Issue have disappeared since one month on a new installation of Raring 13.04.

Revision history for this message
Ernie 07 (ernestboyd) wrote :

This bug STILL occurs in 64 bit 1304 (Raring) with kernel 3.8.0-0-generic.
Downloaded and tested yesterday 2013-01_14

Revision history for this message
Richard Samson (richard) wrote :

Since one week this issue have occured again.

Revision history for this message
Ernie 07 (ernestboyd) wrote :

Richard,

This bug exists in 12.10 and 13.04.

In order to avoid file system corruption while using 12.10 or 13.04, you MUST disable networking before:

1. A system crash (if you have Nvidia hardware, this will be IMPOSSIBLE due to extremely poor drivers).
2. An orderly shutdown/restart.

Another option is to boycott 12.10 and 13.04 until the problem is resolved.

Revision history for this message
Russell Faull (rfaull) wrote :

This bug should be generalised to other file systems. It occurs using xfs and jfs, as well as ext4. In my experience the fs is not relevant, except some recover from an unclean shutdown better than others. (It's easy to try different file systems using fsarchiver, don't forget to change fstab to the new fs before reboot.)

Revision history for this message
Ernie 07 (ernestboyd) wrote :

This bug STILL occurs in 64 bit 1304 (Raring) with kernel 3.8.0-2-generic.
Downloaded and tested today 2013-01_27

Revision history for this message
Francisco Reverbel (reverbel) wrote :

The thing actually got worse in Quantal, as the workaround become ineffective after a recently update. Now fsck runs on each and every boot, even if "Enable Networking" is unchecked before shutdown.

Is anybody else seeing this behavior?

This is the Quantal kernel I am currently running:

$ uname -a
Linux skinny 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Russell Faull (rfaull) wrote :

Does anyone else see two lines about 'hub_port_status failed error (-110)' just before shutdown and immediately following 'mount: / is busy'? These errors always occur on several computers using different filesystems. (None of the workarounds mentioned above or in bug #1061639 resolve the problem of fsck/log replay on next boot.)

Is there a way to kill all usb processes before shutdown to try and determine if the usb system is interfering with the clean unmount of the filesystem?

I'm using kernel 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:05:29 UTC 2013 i686 i686 i686 GNU/Linux

This may need a separate bug report, if the usb system is a possible cause and other filesystems also run fsck/log replay.

Revision history for this message
jim warner (warnerjc) wrote :

I agree that this has gotten worse recently and occurs on every boot.

That, along with the apparent lack of 'interest in/progress on' a solution has really affected my opinion of Ubuntu and confidence in quantal.

My kernel:
3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Ernie 07 (ernestboyd) wrote :

Considering the published intention to have some form of 13.04 running on a Nexus7, I would say that it is FANTASY to expect much to be fixed in 12.10 with the exception of fixes to 13.04 that can be directly backported to 12.10.

Revision history for this message
Russell Faull (rfaull) wrote :

@Jim Warner, does your suggestion at #10 still work for you. It works for me, if I unckeck all connections, wired, wireless and mobile.

Revision history for this message
jim warner (warnerjc) wrote :

@Russell Faull, sadly no. That hasn't worked for me since the problem recurred (a few weeks ago, as I recall).

Fedora (spherical cow) is looking better and better...

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

We saw issues like this in Ubuntu 11.10 as well, and it was resolved by figuring out what is left running just before shutdown.

If you can edit /etc/init.d/umountroot and add this, just before the line starting with ' mount', which on my 12.10 system is line 86:

/usr/sbin/lsof -n > /last-shutdown-lsof

(You may need to sudo apt-get install lsof)

This will record all open files just before root is remounted. Then after verifying that the FS was detected as dirty (please, stop calling it corrupt, it is not corrupt, just dirty) and fsck was run, upload the file /last-shutdown-lsof to this bug and we can take a look at it.

(please check the content of that file. I don't think it will have any sensitive data in it, but please check before uploading as this bug is public).

Judging from the reports, I doubt very much that this has anything to do with the kernel other than Ted T'so's suggestion that the kernel is simply exposing the dirty filesystem.

Steve Langasek (vorlon)
Changed in network-manager (Ubuntu):
status: Invalid → Triaged
importance: Undecided → High
tags: removed: kernel-da-key
90 comments hidden view all 170 comments
Revision history for this message
Max (m-gorodok) wrote :

The cause of ureadahead issue is upstart. The pure case
unrelated to ureadahead is Bug lp: #1181789.
In some cases upstart might hold log files
opened for writing for other daemons.

Revision history for this message
John Clark (clarkjc) wrote :

I added the following line to /etc/init/ureadahead-other.conf to disable logging for this particular Upstart job:

console none

This prevents the log file from keeping the file system busy. The only downside is that anything logged by ureadahead-other goes to /dev/null instead.

Revision history for this message
John Clark (clarkjc) wrote :

P.S. You will have to reboot 2 times after adding "console none" for it to work.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

This does seem like a bug in upstart. It seems to me that there needs to be a command to say "upstart, close all of your log files and do not reopen them" so that one can remount / readonly. Systems may have things that want to keep running right up until poweroff/reboot, but that make use of 'console log'.

Revision history for this message
Ivan Larionov (xeron-oskom) wrote :

Finally did a workaround of this bug with:
1) killing dhclient on umountfs step
2) /etc/init/ureadahead-other.override with "manual" start

AFAIK this bug exists since 12.10 and I have no idea why it still doesn't fixed.

Revision history for this message
Vedran Rodic (vrodic) wrote :

In my case the bug with unclean shutdown happens only when my machine (Thinkpad X230) is docked to the Thinkpad ultrabase when shutting it down.

When I shutdown outside of a dock, everything is fine. I don't use ureadahead (have SSD), doesn't matter if there are mounted network filesystems or not, if NetworkManager is running or not.

Revision history for this message
Steve Dodd (anarchetic) wrote :

My problems (on current saucy) were caused by bugs in upstart (affecting ureadahead) and network-manager. The patches in bug #1181789 and bug #1169614 give me a clean unmount and shutdown.

Revision history for this message
Dmitry Kasatkin (dmitry-kasatkin) wrote :

I have the same problem on 13.04, which is solved by 2 steps (as mentioned above):
1) uninstalling ureadahead or adding "console none" to /etc/init/ureadahead-other.conf
2) killing dhclient in umountfs.

Indeed, why it has not been fixed for "years"....

Revision history for this message
Vedran Rodic (vrodic) wrote :

Dmitry, I confirm your solution. I already uninstalled ureadahead (no need for it with a SSD).

I added killall dhclient to my /etc/init.d/umountfs (at the beginning of the do_stop function).

This problem happens for me only when I use the regular wired ethernet on my ThinkPad X230 (not just when the laptop is docked to the UltraBase as I've reported earlier) . It doesn't happen when I'm using wireless.

Revision history for this message
Vedran Rodic (vrodic) wrote :

I've tried patches in mentioned in comment #137, but they didn't help.

Revision history for this message
Christian Niemeyer (christian-niemeyer) wrote :

This still happens in 13.10 (saucy). This time I installed the Beta-2 of Lubuntu (thus using an lxsession). Reboot or shutdown fails everytime. It hangs for around ten seconds, then it reboots and fsck shows "deleted orphaned inode".

Only helps to uninstall network-manager-*, nm-*, modemmanager, ureadahead.

How come this hasn't been fixed yet? It is reproduceable all the time for me. (My router is fine and network also. I had problems with my old router and dnsmasq in 12.04 but no unclean shutdowns.)

Revision history for this message
Ivan Larionov (xeron-oskom) wrote :

Yeah, still exists in 13.10.

Revision history for this message
Robstarusa (rob-naseca) wrote :

I'm seeing this in 13.10 as well.

Revision history for this message
Christian Niemeyer (christian-niemeyer) wrote :

It occurs that the problem did *not* exist after a recent clean install of 13.10 (64bit Desktop CD) on a friend's notebook. While it still happens on my desktop PC.

Differences:

On the notebook we used wireless (b43 out of the box) internet during installation. Reboot into new system, login, shutdown is clean.

On my desktop I have no wireless card at all. I use wired connection during installation. Reboot into new system, login, using system, shutdown is unclean.

I haven't double-checked this though. But it maybe is a hint, that this problem occurs on machines with no wireless possibilities.

Revision history for this message
Steve Dodd (anarchetic) wrote : Re: [Bug 1073433] Re: Ext4 corruption associated with shutdown of Ubuntu 12.10

That sounds plausible - I would guess wireless connections are usually torn
down at the end of the user session (i.e. logout) whereas I assume wired
connections persist right to system shutdown??
On Oct 21, 2013 3:01 PM, "Christian Niemeyer" <email address hidden>
wrote:

> It occurs that the problem did *not* exist after a recent clean install
> of 13.10 (64bit Desktop CD) on a friend's notebook. While it still
> happens on my desktop PC.
>
> Differences:
>
> On the notebook we used wireless (b43 out of the box) internet during
> installation. Reboot into new system, login, shutdown is clean.
>
> On my desktop I have no wireless card at all. I use wired connection
> during installation. Reboot into new system, login, using system,
> shutdown is unclean.
>
> I haven't double-checked this though. But it maybe is a hint, that this
> problem occurs on machines with no wireless possibilities.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1073433
>
> Title:
> Ext4 corruption associated with shutdown of Ubuntu 12.10
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/upstart/+bug/1073433/+subscriptions
>

Revision history for this message
gweg (gweg) wrote :

I did a bit of hacking on init.d/umountroot, adding lsof and ps -ef after the remount fails.
I could see that dhclient was still running, so I added before the remount: pkill -9 dhclient

After this change, dhclient was gone, but the remount still failed. In lsof output I can see:

init 1 root 15w REG 8,24 1134 438383 /var/log/upstart/mountall.log

It seems like there is a problem in the upstart init where it is not closing files, besides the problem with dhclient.

Revision history for this message
gweg (gweg) wrote :

Sorry, forgot to added version info to #146
Ubuntu Saucy 32-bit, package version: upstart 1.10-0ubuntu7 i386

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Oct 23, 2013 at 10:15:46PM -0000, Gregor Larson wrote:
> init 1 root 15w REG 8,24 1134 438383
> /var/log/upstart/mountall.log

mountall is a service that's supposed to run once at boot and then exit. If
mountall is still running when you shut the system down, then you probably
have a problem in your /etc/fstab (non-existent devices).

We could safeguard against this by making the mountall job exit when we
switch to runlevel 0 or 6. Could you please file a bug against the mountall
package for this issue?

> It seems like there is a problem in the upstart init where it is not
> closing files, besides the problem with dhclient.

There are many possible causes for the filesystem being held writable at
shutdown; it's best to identify each of these and address them individually,
rather than trying to track them all on a single metabug.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from Steve Dodd's message of 2013-10-21 16:16:29 UTC:
> That sounds plausible - I would guess wireless connections are usually torn
> down at the end of the user session (i.e. logout) whereas I assume wired
> connections persist right to system shutdown??

In theory they're brought down when network-manager is stopped. In
practice they may leave lingering bits briefly after that.

Revision history for this message
Alexander (lxandr) wrote :

Guys, come on!
What the heck network-manager and network connections are you talking about?!
This really pissed me off already!
As I've said earlier, the problem is NOT in network-manager!
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1073433/comments/105
because it appears even when network-manager is not installed.
I can assume (or even be sure), that network-manager HAVE some bug(s) associated with this problem, but the main cause - it's not a network-manager. And I'm sure it's in upstart. Believe me, I've spent a lot of time trying to debug this problem...
You can see my comments (and debug logs) about this problem above.

That was the boiling point.
I've moved to Debian.

Revision history for this message
Steve Langasek (vorlon) wrote :

So when I wrote 6 months ago that:

> If you can reproduce this issue, please file a new bug report against
> the sysvinit-utils package with details. It is certainly unrelated to the
> common issue being described here.

Rather than doing this to help yourself, you switch distros, decide that upstart is to blame for a part of the system that is clearly managed by another package, and stay subscribed to the bug so that you can yell at people who are experiencing the bug that was originally reported - a bug that is unrelated to the issue that you were experiencing?

This bug tracker is for helping users resolve bugs in Ubuntu. If you're not using Ubuntu, and you're not helping fix the bugs, your comments are not needed here.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from Alexander's message of 2013-10-25 05:39:16 UTC:
> Guys, come on!
> What the heck network-manager and network connections are you talking about?!
> This really pissed me off already!
> As I've said earlier, the problem is NOT in network-manager!
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1073433/comments/105
> because it appears even when network-manager is not installed.
> I can assume (or even be sure), that network-manager HAVE some bug(s) associated with this problem, but the main cause - it's not a network-manager. And I'm sure it's in upstart. Believe me, I've spent a lot of time trying to debug this problem...
> You can see my comments (and debug logs) about this problem above.
>
> That was the boiling point.
> I've moved to Debian.
>

Alexander I was mistaken is all. Good luck in the future, and please
come back when you have helpful constructive comments.

(BTW the log files that are open are being held open not by upstart
directly, but by processes that are refusing to die and thus refusing to
close their stdout. Upstart would be in error if it were to just close
these log files while the process is still wanting to write to them.)

Revision history for this message
Christian Niemeyer (christian-niemeyer) wrote :

Regardless of that one comment above. But it is really frustrating if now the user's of ubuntu get blamed for not filing the correct bug. Dear folks at Canonical, you have so much information in this one thread, about how to reproduce this error. I'm not that experienced. So please, could someone reproduce this error (take a machine without any wireless capabilities) and then file the correct bug, if this bug report here is not "good enough" for you.

It's pretty amazing that a bug that causes **filesystem errors** (if minor or not minor) does not get fixed at all and now the users get partly blamed for not filing the bug on the correct package? Hello? A default installation causes filesystem errors, reproduceable. Would someone please take those hints in this thread seriously? Like: install on a machine with a wired connection on no wireless adapters. I can reproduce this 100%.

My fix is (taken from this thread): "console none" in /etc/init/ureadahead-other.conf OR uninstalling ureadahead at all.

And more important in /etc/init/umountfs adding the following lines in the beginning of the file:

service networking stop

sleep 1

service networking start

sleep 1

service networking stop

sleep 1

killall dhclient

sleep 1

That worked for me. I don't know what package to file this bug against, because it seems that it happens out of certain hardware and interaction between different packages. I'm not blaming upstart at all. I just want this CRITICAL bug to be fixed. It is around since 12.04/12.10. That's "amazing".

Revision history for this message
Andrej Mernik (r33d3m33r-deactivatedaccount) wrote :

I have tried some workarounds from the comments and nothing seems to work. Fsck still runs at every boot. Bootchart included.

Revision history for this message
Bernd (midox) wrote :

due the respawn of some processes i think they are (re)started again even on shutdown
so they are running if the / is remounted readonly
and that is why it fails

i think upstart should insure that all processes are killed (also the respawning) at the moment we mount / readonly on halt or reboot

my workaround(dirty hack) in the moment
is adding a killall5 -9 just before line 86 of /etc/rc6/umountroot
that works for me and gives no fsck's on my next (re)boots

conclusion for me it's not an NM or Kernel failure
its just a wrong way the shutdown procedure is handled by mixing upstart and sysv initscripts

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from Bernd's message of 2014-01-05 21:37:21 UTC:
> due the respawn of some processes i think they are (re)started again even on shutdown
> so they are running if the / is remounted readonly
> and that is why it fails
>
> i think upstart should insure that all processes are killed (also the
> respawning) at the moment we mount / readonly on halt or reboot
>
> my workaround(dirty hack) in the moment
> is adding a killall5 -9 just before line 86 of /etc/rc6/umountroot
> that works for me and gives no fsck's on my next (re)boots
>
> conclusion for me it's not an NM or Kernel failure
> its just a wrong way the shutdown procedure is handled by mixing upstart and sysv initscripts
>

If you kill everything then the plymouth screen will go away, NFS rootmay
fail, etc. There are other reasons it works the way it does. What is
needed is a better mechanism to notify the user of what is going on and
help them deal with it, and to also report those situations as bugs so
we can deal with them.

Revision history for this message
Bernd (midox) wrote :

You are definivly right in that case
a bug is allready opened(Bug #1073433 ) as i writing a comment to it
The normal user wants a clean shutdown or reboot and as I wrote its a dirty hack until this problem is resolved by the maintainers

we know the schutdown process isn't working correctly as i can see 154 comments on this issue
and the @S60umountroot proccess is definitivly just before reboot/shutdown process

and as i understand it in the right way
@S20sendsigs should stop all running (upstart) processes after that we can
@S31umountnfs.sh which is run before
@S40umountfs which runs before
@S60umountroot which expects that there are no more running processes (or open files)(otherwise we can't remount or unmount) left and at last runs (on a normal system nothing is between them)
@S90reboot

if there ane running processes at that moment that root is remountet or unmounted at shutdown or halt then fsck will complain on next boot.

you could check with adding /bin/ps aux >> /ps_schutdown.log just before line 86 in the /etc/rc6/@S60umountroot script as you can see then there are running processes just before the umount happens. sometimes it is NM(with dhcpclient and dnsmasq etc) but not always
i tried even adding a sleep 60 there to give the runnings some time to end but that wasn't working either because of the respawning ones

an killall5 -9 at that point shoulndt hurt because nfs and fs(tmpfs,run and so on) should be allready unmounted
plymouth is maybe the culprit if it because it stays alive till the bitter end so we have to kill it at this time
i dont mind to have a one or two second blackscreen just before the computer is off or reboots

an option could be that we load plymouth in to memory on shutdown so that there is no disk accces on shutdown

just my 2ct's on this

Revision history for this message
Øyvind Stegard (oyvindstegard) wrote :

In my case it's a dhclient process that likely respawns and prevents remount to read-only of root fs, due to a lease file opened for writing under /var/lib/NetworkManager/.. Result is unclean shutdown and recovery of dirty root file system on next boot.

Attaching lsof output obtained just before call to umount in /etc/init.d/umountroot.
As far as I can see, the only process having a regular file opened for writing on the root file system is dhclient:
dhclient 1317 root 4w REG 8,5 578 530637 /var/lib/NetworkManager/dhclient-90ac51b1-a118-416f-a126-0ad83a2c7b9c-eth0.lease

Clean Ubuntu 13.10 installation. This has certainly become an annoying and long lasting bug now.

Revision history for this message
Øyvind Stegard (oyvindstegard) wrote :

I'd guess patch in bug 1169614 would help in my case (dhclient process). Any progress on evaluating and possibly including the patch provided in that bug ?

Revision history for this message
Benny (benny-malengier) wrote :

Lennert of systemd refers to this bug on google+. He outlines a fix for the simple case: https://plus.google.com/115547683951727699051/posts/LjkLwkeDiLc

Revision history for this message
Steve Dodd (anarchetic) wrote :

This does seem to be getting kind of embarassing. With modern journalled
filesystems on relatively straightforward hardware configs an unclean
shutdown shouldn't be the end of the world (after all, power failures
can happen), but it's not "nice" either.

Unfortunately we also seem to have a hell of a lot of noise on Launchpad
about this, people conflating different issues and causes, not reading
previous posts properly, etc., etc.

I'll be upgrading to 14.04 from 12.04 for main machines when it comes
out, if the problem's still present then I will have another look. Last
time I did, my own problems were caused by dhclient and ureadahead.. I
can't remember which if any of those have now been fixed. If nothing
else, pushing those fixes out will show if there are other outstanding
shutdown problems affecting a lot of users.

On Wed, Jan 22, 2014 at 09:19:13AM -0000, Benny wrote:
> Lennert of systemd refers to this bug on google+. He outlines a fix for
> the simple case:
> https://plus.google.com/115547683951727699051/posts/LjkLwkeDiLc
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1073433
>
> Title:
> Ext4 corruption associated with shutdown of Ubuntu 12.10
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/upstart/+bug/1073433/+subscriptions

Revision history for this message
Max (m-gorodok) wrote :

Steve Dodd:
> Last time I did, my own problems were caused by dhclient and ureadahead..

It is not an ureadahead issue, it is an extra fork in upstart to launch shell for ureadahead if more than one partition mounted.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Jan 22, 2014 at 09:19:13AM -0000, Benny wrote:
> Lennert of systemd refers to this bug on google+. He outlines a fix for
> the simple case:

The fix he outlines is not for this bug. It's not for a bug we have in
upstart in Ubuntu at all; we already reliably ensure telinit u on upgrade of
all of upstart's library dependencies, which are finite and accounted for.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from Steve Langasek's message of 2014-01-22 16:51:06 UTC:
> On Wed, Jan 22, 2014 at 09:19:13AM -0000, Benny wrote:
> > Lennert of systemd refers to this bug on google+. He outlines a fix for
> > the simple case:
>
> The fix he outlines is not for this bug. It's not for a bug we have in
> upstart in Ubuntu at all; we already reliably ensure telinit u on upgrade of
> all of upstart's library dependencies, which are finite and accounted for.
>

I feel like he outlined two bugs. That one, I agree, is handled and
"meh".

The other one is the one that would sweep up the mess we occasionally
see when something misbehaves.

I'd like to see Ubuntu's shutdown do more to protect against that
failure mode.

Revision history for this message
Bernd Schubert (aakef) wrote :

On 01/22/2014 05:51 PM, Steve Langasek wrote:
> On Wed, Jan 22, 2014 at 09:19:13AM -0000, Benny wrote:
>> Lennert of systemd refers to this bug on google+. He outlines a fix for
>> the simple case:
>
> The fix he outlines is not for this bug. It's not for a bug we have in
> upstart in Ubuntu at all; we already reliably ensure telinit u on upgrade of
> all of upstart's library dependencies, which are finite and accounted for.

Why shouldn't switching to an independent file system (tmpfs/initramfs)
and shutdown-init-process not help? That way you can kill all processes
without exceptions. You can even entirely unmount the old root, no need
for remounting it read-only anymore.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Jan 22, 2014 at 05:11:03PM -0000, Clint Byrum wrote:
> The other one is the one that would sweep up the mess we occasionally
> see when something misbehaves.

> I'd like to see Ubuntu's shutdown do more to protect against that
> failure mode.

I would, too, but I don't agree that the method he proposes actually does
this. Killing processes and unmounting devices in a loop is basically what
we do already; the key difference is that some filesystems - potentially
even including the root filesystem - may require additional daemon processes
for their operation. This is the case for example if you have network
filesystems mounted and are using NetworkManager, or if you use
gss-encrypted NFS, or iscsi. So "kill all processes and unmount all
filesystems in a loop" is not a reliable shutdown mechanism, it just moves
the problem cases somewhere that Lennart apparently isn't seeing them.

One of the problems we've seen repeatedly with trying to get clean shutdown
involves NetworkManager's child processes *being* killed while they're still
needed as part of managing the network. This is not a bug that's fixed by
killing more processes.

There may be other failure scenarios that need to be addressed. Part of the
problem has been a lack of information about what's actually holding the
root filesystem open in these cases. There's a pending merge proposal on
sysvinit that should help us gather this information.

Revision history for this message
Steve Dodd (anarchetic) wrote :

On Wed, Jan 22, 2014 at 04:13:19PM -0000, Max wrote:
> Steve Dodd:

> > Last time I did, my own problems were caused by dhclient and ureadahead..
>
> It is not an ureadahead issue, it is an extra fork in upstart to launch
> shell for ureadahead if more than one partition mounted.

Yes, I know - I was summarizing!

On Wed, Jan 22, 2014 at 04:51:06PM -0000, Steve Langasek wrote:
> On Wed, Jan 22, 2014 at 09:19:13AM -0000, Benny wrote:

> > Lennert of systemd refers to this bug on google+. He outlines a fix for
> > the simple case:
>
> The fix he outlines is not for this bug. It's not for a bug we have in
> upstart in Ubuntu at all; we already reliably ensure telinit u on upgrade of
> all of upstart's library dependencies, which are finite and accounted for.

Good to know, thank you.

On Wed, Jan 22, 2014 at 06:21:52PM -0000, Steve Langasek wrote:

[..]
> There may be other failure scenarios that need to be addressed. Part of the
> problem has been a lack of information about what's actually holding the
> root filesystem open in these cases. There's a pending merge proposal on
> sysvinit that should help us gather this information.

I had been going to suggest this a while back - automated apport
reporting of unclean shutdowns, with as much cause information as
possible?

I will try to do something constructive like boot into trusty and make
sure my personal issues have been resolved (I've not looked at this for
a few months.) I'm updating my images as we speak.

Steve

Revision history for this message
Ivan Larionov (xeron-oskom) wrote :

14.04 — it's still a problem (dhclient issue).

dino99 (9d9)
tags: added: trusty
Revision history for this message
Ivan Larionov (xeron-oskom) wrote :

Looks like bug #1169614 was finally fixed.

But there's one more bug which cause same problem: bug #1307008

Changed in network-manager (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in upstart:
status: Confirmed → Invalid
Changed in upstart (Ubuntu):
status: Confirmed → Invalid
Steve Langasek (vorlon)
Changed in linux (Ubuntu):
status: Invalid → Incomplete
Changed in upstart:
status: Invalid → Confirmed
Changed in upstart (Ubuntu):
status: Invalid → Confirmed
Displaying first 40 and last 40 comments. View all 170 comments or add a comment.