resume hangs late in the process appear working and yet report failure on next reboot

Bug #335323 reported by Matt Zimmerman
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
apport (Ubuntu)
Fix Released
Medium
Andy Whitcroft
pm-utils (Ubuntu)
Invalid
Medium
Andy Whitcroft

Bug Description

If a resume hangs late enough in the resume cycle then the resume may appear to the user in front of the machine to have completed successfully. Their screen saver may appear, they can login and use the machine quite normally. However the suspend really is not complete and really is broken. Any attempt to suspend again will apparently do nothing (other than perhaps lock the screen). When the user next reboots the suspend will be detected as failed and reported. Although accurate this is likely to make the reporting user think it is a false report.

We really need to detect the difference and report it differently to prevent it being confusing. It also provides us with a good opportunity to record more information as the machine is likely functional. This implies that the original report which I am hijacking here likely is showing a real failure. But we are unlikely to figure out what from this report as the information is long gone by the time the user sees the problem.

===

I think this is actually a false positive, but I'm reporting it as I think that would indicate a bug as well. This report was created on a fresh boot after a successful shutdown:

-rw------- 1 root root 520082 2009-02-27 08:13 /var/crash/susres.2009-02-27_08:08:59.566718.crash

I don't think there was any failure in this case.

ProblemType: KernelOops
Annotation: This occured during a previous suspend and prevented it from resuming properly.
Architecture: amd64
DistroRelease: Ubuntu 9.04
ExecutablePath: /usr/share/apport/apportcheckresume
Failure: suspend/resume
InterpreterPath: /usr/bin/python2.5
MachineType: LENOVO 6465CTO
Package: linux-image-2.6.28-8-generic 2.6.28-8.26
ProcAttrCurrent: unconfined
ProcCmdLine: User Name=UUID=305dde78-d20a-4248-aaf4-09447b7c5791 ro quiet splash
ProcCmdline: /usr/bin/python /usr/share/apport/apportcheckresume
ProcEnviron: PATH=(custom, no user)
ProcVersionSignature: Ubuntu 2.6.28-8.26-generic
SourcePackage: linux
Tags: resume suspend
Title: [LENOVO 6465CTO] suspend/resume failure
UserGroups:

Revision history for this message
Matt Zimmerman (mdz) wrote :
Revision history for this message
Matt Zimmerman (mdz) wrote :

One suggestion might be to clean up the pm-utils state file during shutdown, so that if the user reboots or shuts down cleanly, it won't trigger a report like this.

However, we still need to find out how I ended up with a state file indicating the system was suspending, when it was clearly up and running.

Revision history for this message
Sergio Rubio (rubiojr) wrote :

I had the crash resuming on a Lenovo ThinkPad X61s, model 7668CTO

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 335323] Re: (false positive?) [LENOVO 6465CTO] suspend/resume failure

On Fri, Feb 27, 2009 at 08:27:27AM -0000, rubiojr wrote:
> I had the crash resuming on a Lenovo ThinkPad X61s, model 7668CTO
>
> ** Attachment added: "report attached"
> http://launchpadlibrarian.net/23171911/susres.2009-02-27_09%3A19%3A20.337714.crash

Please file that separately, as yours appears to be a genuine report.

--
 - mdz

Revision history for this message
Sergio Rubio (rubiojr) wrote : Re: (false positive?) [LENOVO 6465CTO] suspend/resume failure

Thanks Matt, done.

Revision history for this message
Martin Pitt (pitti) wrote :

Andy, can you pleaes have a look at this?

Changed in linux:
assignee: nobody → apw
assignee: apw → nobody
Changed in apport:
assignee: nobody → apw
Revision history for this message
Andy Whitcroft (apw) wrote :

@Martin -- sure .. .thanks for the pointer.

@Matt -- could you confirm if you recently had had a kernel update? And might you have attempted to suspend while a reboot was pending for it? I am suspicious of an interaction there.

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 335323] Re: (false positive?) [LENOVO 6465CTO] suspend/resume failure

On Fri, Feb 27, 2009 at 07:32:54PM -0000, Andy Whitcroft wrote:
> @Matt -- could you confirm if you recently had had a kernel update? And
> might you have attempted to suspend while a reboot was pending for it?
> I am suspicious of an interaction there.

I had upgraded many times without rebooting, so it's very likely that the
kernel had been updated and a reboot was pending.

--
 - mdz

Andy Whitcroft (apw)
Changed in linux:
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → apw
Changed in apport:
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Andy Whitcroft (apw) wrote : Re: (false positive?) [LENOVO 6465CTO] suspend/resume failure

Ok I have managed to hit the same issue here. There is definatly a tie in between a kernel update disabling suspend, an attempt to suspend, and a subsequent reboot. Will look at this interaction.

TJ (tj)
tags: added: test-suspend-seconds
Revision history for this message
Andy Whitcroft (apw) wrote :

In the interim it seems reasonable to also prevent these false positives. When the system is being shutdown normally we know that there cannot be a suspend/hibernate in progress. Should we find an onging suspend/resume indicated we know its is a false indication and can clear it out.

Produced a apport topic branch with this work-around, see related branches.

Revision history for this message
Andy Whitcroft (apw) wrote :

This cannot be a kernel bug. We are recording that we entered suspend (or hibernate) somehow not going there and yet the state remains recorded. This would be a pm-utils bug.

Changed in linux:
status: In Progress → Invalid
Changed in pm-utils:
status: Invalid → In Progress
Revision history for this message
Andy Whitcroft (apw) wrote :

Tested the apport changes in my PPA. Proposing this change for merge to apport trunk.

Andy Whitcroft (apw)
description: updated
Revision history for this message
Andy Whitcroft (apw) wrote :

Ok. In the process of testing this I think I have hit the actual cause of the issue. If a resume fails late on say resuming bluetooth then the resume may have gotten far enough that the user in front of the screen percieves the machine to be fully functional. They may well see the screen lock and be able to login. About the only thing which will not work is a second attempt to suspend which should at most be able to lock the screensaver. If the user (even much) later reboots the machine the failure will be detected and reported. The user has not seen any problem and reasonably will believe that the suspend/resume failure report is false.

We need to enhance the reporting to catch this case and record the information we need as soon as this is detected. Pretty much we can only tell when we are asked to shutdown. At this point we can see that there is still a suspend/hibernate in progress and we are in a position to check for the presence of the hanging processes. We should also report this as a different type of failure to make it clear that it was the resume and that it would have appeared just fine. We should also be recording the basic suspend logs in the apport bugs.

Revision history for this message
vlowther (victor-lowther) wrote :

If you want to pinpoint exactly where in the resume process it is hanging, and you know it is after the kernel haded control back to pm-utils, you can run pm-suspend with PM_DEBUG=true in the environment -- this will cause pm-suspend and all the hooks to be traced.

This is probably one of the hooks hanging, causing the resume process to halt, which prevents pm-utils from releasing its lock file.

Revision history for this message
vlowther (victor-lowther) wrote :

Matt, if you can reliably reproduce the issue can you attach your /var/log/pm-suspend.log file to this bug?

Also after rebooting, can you try running

PM_DEBUG=true pm-suspend

as root and attach the /var/log/pm-suspend.log that creates to this report?

Revision history for this message
Matt Zimmerman (mdz) wrote : Re: [Bug 335323] Re: resume hangs late in the process appear working and yet report failure on next reboot

On Fri, Mar 27, 2009 at 03:38:49PM -0000, vlowther wrote:
> Matt, if you can reliably reproduce the issue can you attach your
> /var/log/pm-suspend.log file to this bug?
>
> Also after rebooting, can you try running
>
> PM_DEBUG=true pm-suspend
>
> as root and attach the /var/log/pm-suspend.log that creates to this
> report?

I haven't been able to reliably reproduce this, though it has happened to me
more than once.

I've modified pm-action to set PM_DEBUG=true and will grab the log file
if/when it happens again.

Andy, should we just enable this by default for now? I don't expect the log
file is excessively huge, and this would be useful debug data.

--
 - mdz

Revision history for this message
Andy Whitcroft (apw) wrote :

We get a fair amount of additional data in the pm-suspend.log and my latest branch both fixes detection of this kind of failure and adds reporting of this additional logfile

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apport - 0.147

---------------
apport (0.147) jaunty; urgency=low

  * bin/apportcheckresume: report the pm-suspend.log/pm-hibernate.log
    from /var/lib.
  * bin/apportcheckresume: only attempt to attach the stress log if its is
    present.
  * bin/apportcheckresume, debian/apport.init: add detection for late
    resume hangs, those where the user thinks the system was working.
    (LP: #335323)

 -- Andy Whitcroft <email address hidden> Mon, 30 Mar 2009 09:47:28 +0200

Changed in apport:
status: In Progress → Fix Released
Revision history for this message
Andy Whitcroft (apw) wrote :

I believe the apport changes are sufficient to detect and report this late-resume issue, closing out the pm-utils task.

Changed in pm-utils:
status: In Progress → Invalid
Revision history for this message
Matt Zimmerman (mdz) wrote :

On Mon, Mar 30, 2009 at 10:23:26AM -0000, Andy Whitcroft wrote:
> I believe the apport changes are sufficient to detect and report this
> late-resume issue, closing out the pm-utils task.

Shouldn't pm-utils still clean up its lockfile on shutdown?

--
 - mdz

Revision history for this message
Andy Whitcroft (apw) wrote :

We will remove that status file when it reports the bug on the next boot. As there now a hang log that will trigger a different type of bug. We have a number already. They are looking like there is a chvt hang, and there may be a tie in to jumping to the gdm login screen. I am suspicious the x-server is exiting. We are tracking this on bug #352178.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.