suspend corrupted my hard-disk (VERY BAD)

Bug #68490 reported by stacktracer
14
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
linux-source-2.6.20 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

There's a key on my keyboard with a crescent-moon picture that, by default on Edgy, makes my system suspend to RAM. I tried it for the first time yesterday.

After suspending this way, my machine appeared to resume normally. However, the subsequent reboot failed, dumping me into an initramfs shell.

I booted from the live CD, choosing the rescue option. It tried to run a shell from my hard disk, but encountered some error executing sh (the error message didn't have any helpful details).

I booted from the live CD again, choosing the default option. From the live-CD environment, I tried mounting my data partition. Reinstalling the system is no big deal, as long as my data is okay.

The data partition wouldn't mount. Mount just hung.

Reiserfsck said the superblock was corrupted. After pondering my options, I did "reiserfsck --rebuild-sb", which succeeded, but suggested running again with "--rebuild-tree". I did "--rebuild-tree", which succeeded, but dumped many, many files into the "lost+found" directory on the partition. I don't have any way to check whether any files are permanently gone.

So my first question is:
   * Is there a way to altogether disable suspend and hibernate?

My other questions are:
   * Why would suspend overwrite the superblock of my data partition?
   * Why would suspend write to a partition other than the swap partition?
   * Since this is suspend to RAM, why would it touch the disk at all?
   * Is there any chance that the superblock was the only thing overwritten, or did I (as I assume) lose some files as well?
   * Why is a feature that is so potentially destructive the DEFAULT BEHAVIOR FOR A KEY ON THE KEYBOARD??

Notes:
   * This is with the new Edgy release, on a machine with two dual-core opterons.
   * The volume that got overwritten was a RAID1 volume. Both physical partitions of the RAID volume got overwritten.

Revision history for this message
stacktracer (stacktracer) wrote :

I figured I would try suspend again now, since I have a fresh backup of my data (at least, as much as was recoverable after last time). Same thing happened:

   * When I push the crescent-moon button, the screen fades to black, then the machine seems to turn off (power LED goes from steady-on to blinking).

   * I push the machine's power button, and almost immediately get the screen I had before suspending. No BIOS-POST, no grub; just directly to the suspended state.

   * Then the hard-disk LED comes on and stays on.

   * From a terminal, I ran "sudo reboot". Got a dozen or so error "unrecognized command" error messages -- like the reboot executable is trying to be interpreted by a shell. Very weird. Halt does the same thing.

Any suggestions of things to try, while my system is already broken?

Revision history for this message
stacktracer (stacktracer) wrote :

Same thing happens with the i386 install and generic kernel. Will try the i386 kernel.

Revision history for this message
stacktracer (stacktracer) wrote :

Left something out. What I should have said was:

Same thing happens with the i386 install and generic kernel, and with no RAID.

Will try the i386 kernel.

Revision history for this message
stacktracer (stacktracer) wrote :

Same problem with the i386 kernel. So it's not RAID, or SMP, or 64-bit.

Revision history for this message
stacktracer (stacktracer) wrote :

My attempts to disable this feature have thus far been thwarted:

   * The gnome-session package depends on the acpi packages that do the suspend/hibernate stuff. So I can't uninstall those without some pain.

   * I tried booting with "acpi=off" ... but the kernel can't handle it. It appears to cycle through pci addresses, sending a command to an address and waiting for it to timeout. Maybe that's the only way it can detect pci devices without acpi? I don't know. I waited about 15 minutes and it still hadn't finished booting. So I can't disable acpi.

   * If I remove gnome-power-manager from my gnome session, that keeps the crescent-moon key from doing anything, and the suspend/hibernate options don't show up in the quite dialog. HOWEVER, bringing up the quit dialog starts gnome-power-manager if it's not already running. So the second time the quit dialog comes up, the suspend/hibernate options are there.

Revision history for this message
stacktracer (stacktracer) wrote :

Ignore my last post. What I'm looking for is in /etc/default/acpi-support.

Sorry for the ignorant tirade; this whole thing has me pretty grumpy.

Revision history for this message
Christoffer Karvonen (xopher) wrote :

This happened to me too. Running 64-bit Ubuntu Edgy with the latest generic kernel.

Is this a new feature? I mean suspending by pressing the key on the keyboard. Ive been using this key for locking my screen, since I hardly never turn off my computer. So the follow-up question would be: How can I disable this 'feature'?

Im using ext3 and at the following boot, it finds massive errors and fsck starts fixing them. Took me almost 2 hrs last time.

" * If I remove gnome-power-manager from my gnome session, that keeps the crescent-moon key from doing anything, and the suspend/hibernate options don't show up in the quite dialog. HOWEVER, bringing up the quit dialog starts gnome-power-manager if it's not already running. So the second time the quit dialog comes up, the suspend/hibernate options are there."
^ I guess this would be a temporary solution for me, but Id like a better, real solution for the problem.

Revision history for this message
stacktracer (stacktracer) wrote :

In /etc/default/acpi-support, there's an option to use S1 sleep ("standby") instead of S3 ("suspend to ram"). S3 was the culprit for me; S1 works fine.

What motherboard/chipset/etc. did you encounter the problem on?

(The machine that gave me trouble is an IBM with two dual-core Opterons. I don't know any more specific info off the top of my head, and I don't have access from where I am now to that machine. I will check my motherboard/chipset/etc. tomorrow.)

Revision history for this message
stacktracer (stacktracer) wrote :

My machine is an IBM IntelliStation A Pro, Type 6217.

It has an IBM motherboard, which doesn't seem to have any model number or identification other than "the motherboard for the IntelliStation A Pro, Type 6217."

Chipset is AMD-8111 (and AMD-8131 for PCI-X).

Revision history for this message
Peter Whittaker (pwwnow) wrote :

Folks, this in an interesting on, additional information will be required to chase it down. If with at all possible, please update the report with the following information:

The output from "uname -a", in the body of the report;

The output of "sudo lspci -vv", attached to the report (e.g., run "sudo lspci -vv> /tmp/lspci-vv and attach lspci-vv to the report - do not compress, do not include in-line);

The output of "sudo lspci -vvn", attached to the report (as above);

The output of "sudo dmidecode", attached to the report (ditto); and, if you can,

Relevant output from performing the tests on https://wiki.ubuntu.com/KernelSuspendDebugging

Thanks all, your assistance is greatly appreciated!

Changed in linux-source-2.6.17:
status: Unconfirmed → Needs Info
Revision history for this message
stacktracer (stacktracer) wrote :

Sorry for the long silence. Since this is a destructive problem, I have to shuffle hard disks around before reproducing it, so it got put on the back burner.

The following info is with Feisty beta (not Edgy), but the problem still occurs, just like on Edgy.

$ uname -a
Linux sleeper 2.6.20-12-generic #2 SMP Wed Mar 21 19:34:23 UTC 2007 x86_64 GNU/Linux

lspci -vv output is attached. Other requested attachments to follow.

I followed the instructions at https://wiki.ubuntu.com/DebuggingKernelSuspend , but found no "smoking gun" lines in dmesg. All that showed up in dmesg was:

    Magic number: 0:798:917
    hash matches drivers/base/power/resume.c:46

Revision history for this message
stacktracer (stacktracer) wrote :
Revision history for this message
stacktracer (stacktracer) wrote :
Revision history for this message
Peter Whittaker (pwwnow) wrote :

Marking confirmed and assigning to appropriate kernel based on attached information.

Can someone with QA or Dev privileges please mark this as Critical, since this a data loss bug? Thanks....

Changed in linux-source-2.6.17:
assignee: nobody → ubuntu-kernel-team
status: Needs Info → Confirmed
Revision history for this message
BobNJ (bob-nj) wrote :

I have a same problem as original poster stated in the title. Suspend (to RAM) corrupted my hard disk, my / partition went corrupted, suspend would not resume, and subsequent machine restart failed at grub phase, grub being unable to find it's own /boot/grub files... given my machine hosts another win2k partition, the hanging grub and missing dear Ubuntu was a shock.

The / partition was beyond repair, I managed to fsck it from live cd, but the reboot from it was not possible, so I reinstalled ubuntu and will now post here the machine details and outputs as instructed above.

Given my machine is home built, I did not really expect the suspend to work flawlessly (or at all), but the filesystem corruption is a bit too much... Machine behaves flawlessly in all other aspects, both under win2k and ubuntu...

And now, the details:
user@user-desktop:~$ uname -a
Linux user-desktop 2.6.22-14-generic #1 SMP Thu Jan 31 23:33:13 UTC 2008 x86_64 GNU/Linux

Revision history for this message
captaintrav (captaintrav) wrote :

I have experienced the exact same thing as the last commenter, BobNJ, twice( I'm not a fast learner). I don't have any other OS, but the machine works fine otherwise.
After changing /etc/default/acpi-support to use ACPI_SLEEP_MODE=standby, it works fine, although no suspend-to-ram (obviously).

travis@tbird:~$ uname -a
Linux tbird 2.6.22-14-generic #1 SMP Tue Feb 12 02:46:46 UTC 2008 x86_64 GNU/Linux

STR not working, I can live with, but the disk corruption did lose some significant amounts of data for me, although it wasn't critical. This is a scary problem, especially since STR is the default setting, unlike on Win32.

Revision history for this message
Adam Petaccia (mighmos) wrote :

I've had the exact same problem. STR will corrupt the super block off of the bootable linux partition, and the other drives just wind up corrupted. So /dev/sd[boot] winds up having to be reformatted (because everything is in /lost+found), but everything is okay after the journal recovers. Other not in use partitions (such as Windows partitions) are unaffected, except for the fact that my boot loader can't load until I reinstall Ubuntu.

Revision history for this message
Adam Petaccia (mighmos) wrote :

[edit] Turns out everything isn't okay. Just lots of corruption.

Revision history for this message
Adam Petaccia (mighmos) wrote :

Sorry for the spam and not following directions. BTW: setting mem=standby did not help.

Revision history for this message
Adam Petaccia (mighmos) wrote :

Linux belthazor-saved 2.6.24-18-generic #1 SMP Wed May 28 19:28:38 UTC 2008 x86_64 GNU/Linux

Revision history for this message
Adam Petaccia (mighmos) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote : This bug is now reported against the 'linux' package

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this bug to the new "linux" package. However, development has already began for the upcoming Intrepid Ibex 8.10 release. It would be helpful if you could test the upcoming release and verify if this is still an issue - http://www.ubuntu.com/testing . If the issue still exists, please update this report by changing the Status of the "linux" task from "Incomplete" to "New". We appreciate your patience and understanding as we make this transition. Thanks!

Revision history for this message
Adam Petaccia (mighmos) wrote :

I just got a new hard drive in, so I will try to test 8.10 STR and STD on it within the next 1-3 weeks.

Revision history for this message
Adam Petaccia (mighmos) wrote :

I couldn't test, because my DVDRW drive near exploaded when I tried to boot Intrepid (hardware failure, not software). But looking at the kernel changelog, commit 20ed5d71446c103126c14a37560d3697a5287493 (Ubuntu's kernel git) seems to address this problem, and the related link http://lkml.org/lkml/2008/5/25/96 seems to be the same issue.

I'm just scared to lose everything again when I try this. But if anyone braver wants to try...

Revision history for this message
Adam Petaccia (mighmos) wrote :

I made backups, and I just tried hibernating and suspending my machine with a custom hardy git kernel, and I could hibernate without problems. Although I couldn't successfully resume due to an issue with nVidia's binary driver I still suffered no data corruption.

I'm typing this on my machine which has successfully hibernated and come back 3 times now.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Adam Petaccia (mighmos) wrote :

I am no longer affected by this bug in Intrepid.

Revision history for this message
Jouni Mettala (jouni-mettala) wrote :

Marking fix released based on last coment. Thanks.

Changed in linux:
status: Incomplete → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.