fsck freezes on laptop

Bug #124773 reported by Reuben Firmin
10
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Invalid
High
Canonical Kernel Team
linux (Ubuntu)
Expired
Undecided
Unassigned
linux-source-2.6.22 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: e2fsck-static

fsck freezes during the auto-check of my laptop's hard drive. Perhaps there is a hardware problem, but the laptop has been running without problems for weeks; I would expect fsck to mark the problem and move on, not freeze.

Tags: kj-expired
Revision history for this message
Theodore Ts'o (tytso) wrote :

What version of e2fsck are you using?

Revision history for this message
Reuben Firmin (reubenf) wrote :

1.39+1.40-WIP-2006.11.14+dfsg-2ubuntu1

BTW, I had to turn it off for the disks in question, as I couldn't boot the machine otherwise.

Revision history for this message
Theodore Ts'o (tytso) wrote :

What happens when you manually run the command "e2fsck -n /dev/hdXX", where /dev/hdXX should be your laptop filesystem. Does anything get printed before it "frezes". Is there any disk activity when it is "frozen".

Revision history for this message
Reuben Firmin (reubenf) wrote :

To clarify, by "frozen", I mean that the percentage progress bar freezes at an arbitrary number (63.5%, 70%,...), that the machine does nothing for 25 minutes, and that there is no disk activity. Is it possible there is still work going on under the surface?

Here is fsck on my disks:

reuben@travel:~$ sudo e2fsck -n /dev/sda1
Password:
e2fsck 1.40-WIP (14-Nov-2006)
Warning! /dev/sda1 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 293838 has zero dtime. Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(600760--600767) -(600776--600813) -600861 -(600864- -600968) -(612750--612751) -(621870--621876) -(621878--621980)
Fix? no

Free blocks count wrong (1647772, counted=1647470).
Fix? no

Inode bitmap differences: -293838
Fix? no

Free inodes count wrong (1100881, counted=1100861).
Fix? no

/dev/sda1: ********** WARNING: Filesystem still has errors **********

/dev/sda1: 120719/1221600 files (0.5% non-contiguous), 794100/2441872 blocks
reuben@travel:~$ sudo e2fsck -n /dev/sda5
e2fsck 1.40-WIP (14-Nov-2006)
Warning! /dev/sda5 is mounted.
Couldn't find ext2 superblock, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open /dev/sda5

The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Revision history for this message
Theodore Ts'o (tytso) wrote :

That sounds suspiciously like Debian bug #411838, which was fixed in Debian version 1.39+1.40-WIP-2007.04.07+dfsg-1 or later.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Let's see if I can confirm whether or not this is the same bug. First of all, can you reproduce it if you re-enable fsck checking?

Secondly, do you remember what percentage number it froze at, and was it always the same? If so, what was the number?

Revision history for this message
fineghal (rrr4th) wrote :

I can confirm the same problem. Save version of FSCK, does autocheck after 30 boots.

The check then freezes at some random percentage.

The last time this happened I had to boot a live cd and check from that. I also later tried an fsck from the cli when I'd booted, and that worked fine.

This freeze appears to only occur when running automatically at boot.

This is however a severe bug for those it effects, as you cannot boot the system.

Revision history for this message
markph (markph) wrote :

Same problem on my end. Here is sample output:

* Checking root file system...
fsck 1.40-WIP (14-Nov-2006)
/dev/sda1 has been mounted 30 times without being checked, check forced.
/dev/sda1: |========= / 15.2%

I ran manually (Live CD booted) and had no problems or errors with the disk.

Revision history for this message
fineghal (rrr4th) wrote :

The last time this happened, holding Ctrl + C/tapping repeatedly seemed to help a little bit. And by a little bit I mean it'd freeze consistently between 75-80 percent as compared to a more typical 20-30 percent.

Revision history for this message
Reuben Firmin (reubenf) wrote :

I accidentally ran shutdown -f, and almost got locked out of my laptop. I had to hard-boot 6 times before fsck would complete; the other times it would freeze at varying percentages between 30% and 70%. I gave it 10 minutes to wake up on some of the freezes, and it didn't do anything.

Since this is such a serious bug (potentially locking users out of their machines) I think you should allow the fsck step to be skipped, perhaps with a grub-like timeout. The only guaranteed way to bring back a system that has this problem is boot with a live cd and edit fstab to remove the fscheck from boot - this is not an option for many users.

Revision history for this message
Theodore Ts'o (tytso) wrote :

I'd much rather fix the bug for good. So first of all, if always freezes at the same percentage, that is a bug that has already been fixed (see Debian bug #411838, which was fixed in Debian version 1.39+1.40-WIP-2007.04.07+dfsg-1 or later). It looks like at least some of the people who are reporting this bug are saying that it happens at a different percentage completed which changes from run to run of e2fsck. That sounds like a device driver problem, especially when people further report it's fine after the system boots, or if they are using a live CD (I assume this is a Ubuntu live CD which is the same version as your system, so it's the same version of e2fsprogs on the live CD as your system)?

I'll note that I'm not hearing this complaint from users of any other distribution, which again makes me suspicious that it's something Ubuntu specific.... like the Ubuntu kernel.

OK, so let's take this from first principles. #1, can you create a compressed e2image of your disk (see the REPORTING BUGS section of the e2fsck man page). #2, can you re-enable fsck at boot, force an fsck, by using the command tune2fs -C 99 /dev/hdXXX, and rebooting, and confirm that it is still locking up for you. #3, if it is, can you place the compressed e2image file at a URL where I can download it.

Optionally, you can unbzip the image file in a scratch filesystem where you have enough room, and try running e2fsck over that image file. If it doesn't hang when you try e2fsck'ing the e2image file, and yet it does hang at boot, then it's definitely either a hardware problem or a kernel problem, and we can redirect this bug report to the Ubuntu kernel developers.

Revision history for this message
Reuben Firmin (reubenf) wrote :

If I reboot after the tune2fs command, is my only recovery (assuming it locks) the livecd option? In which case, it'll be a week or so before I can get this, as I don't have one available.

> I'd much rather fix the bug for good.

Regardless of the fix for this bug, I think giving the user more control is always good. What if they are powering on for something mission critical and then get hit with a 10-15 minute fsck session (which can happen, if there are multiple large drives being checked)?

Revision history for this message
Theodore Ts'o (tytso) wrote :

Part of the problem that I'm concerned about is that the vast majority of Ubuntu users are less experienced that say Debian users. That's not a slight against Ubuntu users, but merely a statement of fact; Ubuntu has done a lot of good work to allow less experienced users to be able to install and use Linux. That is a good thing; a very good thing. But it also means that sometimes the protecting users against themselves is in fact a good thing. There is a reason why there are safety interlocks on lawn mowers; giving more control to inexperienced users is not always a good thing.

So if the filesystem is corrupted such that if the system is booted, the "mission critical" application would silently give the wrong answers, or perhaps trade the wrong stocks, or give the 1000 times the amount of X-rays necessary to the human body, would you really be doing the user a favor by giving them the ability to skip an fsck because they are impatient? For life and mission critical systems, usually the designers want to give less control to the users (who often are not sophisticated computer users), not more control.

If the system has to be kept running in order to keep some mission critical system going, then the right answer is to have backup systems and a high availability system (such as Linux-HA) which enables the backup when the primary system is not available. Skipping necessary filesystem checks just because "it might take too long" and allowing potential silent failures is Just A Bad Idea.

Then too, if you really want to avoid long delays due to periodic fsck's, the right answer is to use devicemapper, and have a cron script fired during the off-hours (say 1am on Sunday nights, when no one is using the system), which takes a read-only snapshot of the filesystem, and then run the e2fsck against the snapshot once a week or once a month. If there are any discrepancies detected when checking the read-only snapshot, then the script should either send e-mail to the system administrator requesting scheduled maintenance ASAP to fix the problem, or if there is a HA system running, the script should signal the HA system that it is about to take the system down, then shutdown the applications and force a reboot and fsck of the corrupted filesystem. If no errors are detected in the read-only snapshot, then the read-only snapshot can be released and "tune2fs -C 0 -T now /dev/sdXX" can be used on the original filesystem indicate that it has been successfully checked. So there are clean ways of avoiding the slow boot-time checks while actually increasing the system reliability, besides letting a potentially clueless user skip a necessary system function out of impatience.

Revision history for this message
Reuben Firmin (reubenf) wrote :
Download full text (3.5 KiB)

OK, latest update to this: the problem is still in gutsy. I haven't yet run fsck on the images you suggested making, but will do so.

In the meantime, my laptop was a brick for 3 days during a technical conference, when I _really_ would have appreciated having it be functional, so I'm going to argue your points. Yes, there's a bug in fsck which caused it, and sure, when you fix the bug, I won't need the skip option...until the next time there's a bug like this, when another user on another computer with some weird hardware or bios configuration hits a similar snag.

> Part of the problem that I'm concerned about is that the vast majority of Ubuntu users are less experienced that say Debian users.

So? As my friend points out, Windows users are a good deal less experienced that Ubuntu users, and yet *they* can skip scandisk. Heresy, I know, to compare scandisk with fsck, and yet there users have a choice. I think to deny users a choice is anti-freedom, and autocratic. Your thinking on this analagous to Microsoft's forcing of system updates on its users; "we know best; this is for your own good; shut up and swallow it, buckwad."

I am not suggesting that the option to skip fsck be so obvious as to make it easy for "noobs" to cancel it every time. In fact, you could even display horrible warnings when the user does skip it. But, if the user wants to completely wreck their computer by skipping maintenance steps, then let them. You are not their parent, nanny, dictator, or any other authority.

> So if the filesystem is corrupted such that if the system is booted, the "mission critical" application would silently give the wrong answers, or perhaps trade the wrong stocks, or give the 1000 times the amount of X-rays necessary to the human body, would you really be doing the user a favor by giving them the ability to skip an fsck because they are impatient?

But we're not talking about computers that are monitoring nuclear power plants, or running vital infrastructure; we're talking about average everyday joe who wants to check his email, show off some presentations at work, write documents, etc. Just as you wouldn't require average everyday joe to fill out a 30 point checklist every time they boot the system to make sure everything is in order, you shouldn't force "mission critical" level maintenance checks on him either.

> Then too, if you really want to avoid long delays due to periodic fsck's, the right answer is to use devicemapper, and have a cron script fired during the off-hours (say 1am on Sunday nights, when no one is using the system)

This is not the right answer on a laptop, or an average user's desktop. In these cases, the user powers down their computer on a regular basis.

There are two ways I could see doing the workaround (which is completely seperate to the issue of this bug), which would make power users happy, and keep your noobs in line:
1) Add a boot option that skips fsck. Perhaps "safe mode" on ubuntu would include this, perhaps not.
2) Add a thread that listens for a key sequence (ctrl c?); when it detects the sequence, display a nasty message, and only abort the scan if the user confirms that they're willing to die fo...

Read more...

Revision history for this message
Theodore Ts'o (tytso) wrote :

Well, for someone who knows what they are doing, they *can* skip the check. They can simply boot into simple user mode, and use tune2fs to adjust mount count. Or, if you are booting using a plain text console, just edit /etc/e2fsck.conf, add:

[options]
   allow_cancellation = true

This will allow ^C to work. However, if you are using a graphical boot, it's up to the graphical boot manager to forward the ^C to e2fsck, which is not my problem. Personally, I think it's a REALLY, REALLY BAD IDEA, given my understanding of what Ubuntu is aiming for. There are solutions for technically clueful users; they're just not obvious, as you proposed. If you don't know to boot into single user mode, you probably don't have enough clue to understand when it's safe to do this, and when not to.

Also, note that for users who are complaining that fsck is freezing at "random places", there's almost some kind of hardware and/or kernel bug going on. Bypassing the fsck isn't going to help, and in fact, may cause much larger forms of data loss later on.

Revision history for this message
Reuben Firmin (reubenf) wrote :

> Well, for someone who knows what they are doing, they *can* skip the check. They can simply boot into simple user mode...

Nope, that still hits fsck. The only way I've been able to recover the system when it is in "brick" mode (i.e. fsck is queued) is using a live cd, and then adjust tune2fs or fstab as you suggest.

The config option is a new one to me, so thanks for that info. As you say, though, it's dependent not being in a graphical boot...which ubuntu is. Still, I'll add a bug to the bootsplash folks and see what they suggest.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Ah, Ubuntu must set up their scripts so that even in single user mode, it doesn't bypass single user mode. Well, that's what init=/bin/sh is for (although this might not work if your boot setup requires an initrd ramdisk image --- but that's a boot scripts issue).

BTW, and if you don't know how to change Ubuntu to use a text boot (by editing /boot/grub/menu.lst) I would again question whether you know enough not to screw yourself up by bypassing necessary system functions willy-nilly. A little knowledge can be a dangerous thing. It's probably more OK for Windows because Windows users are used to arbitrarily losing all of their data.... :-)

Can we get back to why your system was hanging in fsck in the first place? What version of e2fsck did you have installed on your system, and was it always hanging at the same percentage level?

Revision history for this message
Reuben Firmin (reubenf) wrote :

I will help you with debugging info in a couple hours...but just to respond to:

> BTW, and if you don't know how to change Ubuntu to use a text boot (by editing /boot/grub/menu.lst) I would again question whether you know enough not to screw yourself up by bypassing necessary system functions willy-nilly.

I do, and I often edit grub boot options on the fly. However, I don't like touching menu.lst, because apt-get tends to blow away my changes when it ups the version of grub, and/or makes new kernel images.

Revision history for this message
Theodore Ts'o (tytso) wrote :

If you edit the right sections of the menu.lst file (the automatic options sections), then apt-get won't blow away your changes when you install new kernel versions.

Revision history for this message
Bryce Harrington (bryce) wrote :

I encountered this bug as well, on Ubuntu Gutsy. It consistently "freezes" at exactly 34.2%. Booting into a LiveCD and running fsck manually on the partition resulted in a freeze at the same point. This is using 1.40.2-1ubuntu1 (same on livecd and booted system).

Changed in e2fsprogs:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Bryce Harrington (bryce) wrote :

Attached is some basic system info. Let me know if there's other info that'd be useful in debugging this problem.

This is one of the 1505n Dell laptops sold with Ubuntu preinstalled. I restored the system to the factory image, then upgraded to Gutsy (or possibly gutsy-rc) about 2-3 weeks ago. As far as I remember, this is the first time it hit fsck time since the installation. I've used the tune2fs trick to get around it for now.

This particular boot followed a lockup after doing a suspend (which I suspect may be a compiz conflict), when I had to force a power cycle.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Hi Bryce,

It's very likely your problem is different from the problems reported by people pre-e2fsprogs 1.40.

Please refer to the section REPORTING BUGS in the e2fsck man page. In particular, what would be most helpful is the output of dumpe2fs and a compressed raw e2image snapshot of your filesystem. i.e., replacing /dev/hda1 with your device, something like this:

            e2image -r /dev/hda1 - | bzip2 > hda1.e2i.bz2

Please note the privacy implications of this command as documented in the e2image man page before making it available in a public location such as using a launchpad attachment. In particular, a raw e2image file will expose the names of the files in your filesystem, although not the content of the files themselves. Hence, if you have any files with names such as "ch1ld pr0n" or "secr1t Al Quaeda attack plans" or even "letter to my mistress" :-), please make sure you understand what information you are making available when you send me such a file. If you want to send me private e-mail with download information, I will honor any confidentiality request you might make of me, and I won't look at any filename information except as might become available as I am debugging your problem.

Best regards,

Revision history for this message
Reuben Firmin (reubenf) wrote :

Theo, I still haven't gotten time to run a check on an image (I'll aim to do it this afternoon) but:

I encountered the freeze-on-boot problem this morning, booted with a (feisty) live cd, and from the cd ran fsck. It did not freeze when invoked like that.

Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Theodore,

Attached is a copy of the e2image. Sorry it's taken this long - I needed to wait until I could reimage the machine.

Revision history for this message
Bryce Harrington (bryce) wrote :
Revision history for this message
Theodore Ts'o (tytso) wrote :

Hi Bryce,

I've received your e2i file and e2fsck doesn't hang when I check it. This makes it very likely that what you have is a hardware and/or kernel bug. Does it still hang when you run e2fsck on your machine?

Revision history for this message
Theodore Ts'o (tytso) wrote :

So no one else is reporting this problem from any other distribution, all evidence is pointing to a Ubuntu-specific kernel problem. I will likely be transferring this bug to the kernel package in the absence of any evidence to the contrary.

Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Theodore,

Yes, even after a reimage the issue still happens.

Sorry to hear it seems like a kernel issue. I guess I'll just disable fsck on this system for now.

Bryce

Revision history for this message
EKM (erik-mitchell) wrote :

Does anyone know if this problem is filed as a bug anywhere else? Is it a bug for the linux-image package? I'm going to be disabling fsck on my drive but would like to follow this to see when it's fixed, so I can reenable... I assume this bug will be closed when everything is worked out?

Thanks,

Revision history for this message
Theodore Ts'o (tytso) wrote :

I'm pretty sure it's a bug for the linux-image package, since no other distributions are reporting anything even vaguely like this. And there appears to be more than one Ubuntu user seeing it. (I use Ubuntu, but I use my own kernel).

It might be useful if people who are reporting this problem could compare notes amongst themselves about which Ubuntu kernel they are using. One of the unforunate bits of Launchpad is that it doesn't automatically collect information such as "which kernel are you using", and of course, when people chime in with "me too", there is no automated collection of which versions of any packages running on their system.

Revision history for this message
EKM (erik-mitchell) wrote :

The other problem with this one is it, for most users, only happens after every 20th reboot, which can be a cycle of more than a month, easily. I'm not sure what my current remount count is on my laptop. I'll post back with my current kernel version when I have my laptop back up.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Well, you can force it to happen more frequently by changing the number of mounts between filesystem checks using tune2fs. The point is that given that this is almost certainly hardware- and kernel- specific problem, someone who is experiencing the problem is either going to need to help debug it, or someone is going to have to pay $$$ to a company who is willing to provide professional support for an ubuntu distribution, just like customers of RHEL or SLES pay Red Hat, Novell, or IBM for that kind of professional level support. Or, you will need to find some volunteer with time who is willing to provide professional-level support for free. Unfortunately, I don't have the time to do that kind of in-depth support where I can't reproduce the problem on my local systems and a support personnel needs to work with the customer step by step to try to reproduce the problem and fix the problem.

PS, for each person who is saying, "I'm seeing the same problem" --- if the percentage complete is always freezing at the same level, and the version of e2fsprogs is older than 1.40, then it's a different problem. You need to upgrade to a newer version of e2fsprogs. If it is freezing at a seemingly random percentage which changes from boot to boot, then it's probably a kernel level problem. There are two bugs which users are conflating into the same bug report, just because it superficially has the same failure and they are just trying to find similar bug reports using google or some other search engine.

Revision history for this message
Reuben Firmin (reubenf) wrote :

Theo reckons this is a kernel issue, so I'm assigning to you guys. It makes my laptop inoperable. Please debug!

Changed in e2fsprogs:
assignee: nobody → canonical-kernel-team
Revision history for this message
Theodore Ts'o (tytso) wrote :

Not an e2fsprogs bug. Seems to be a kernel bug.

Changed in e2fsprogs:
status: Confirmed → Invalid
Revision history for this message
Sergio Zanchetta (primes2h) wrote :

The 18 month support period for Gutsy Gibbon 7.10 has reached its end of life -
http://www.ubuntu.com/news/ubuntu-7.10-eol . As a result, we are closing the
linux-source-2.6.22 kernel task. It would be helpful if you could test the
new Jaunty Jackalope 9.04 release and confirm if this issue remains -
http://www.ubuntu.com/getubuntu/releasenotes/904overview. If the issue still exists with the Jaunty
release, please update this report by changing the Status of the "linux (Ubuntu)"
task from "Incomplete" to "New". Also please be sure to run the command below
which will automatically gather and attach updated debug information to this
report. Thanks in advance.

apport-collect -p linux-image-2.6.28-11-generic 124773

Changed in linux-source-2.6.22 (Ubuntu):
status: New → Won't Fix
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Przemek K. (azrael) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. We are sorry that we do not always have the capacity to look at all reported bugs in a timely manner.
There have been many changes in Ubuntu since that time you reported the bug and your problem may have been fixed with some of the updates. It would help us a lot if you could test the current Ubuntu development version (10.04). If you can test it, and it is still an issue, we would appreciate if you could upload updated logs by running apport-collect 124773, and any other logs that are relevant for this particular issue.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.