2.6.38-10 panic after ejecting drive

Bug #793796 reported by steubens
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Natty
Fix Released
Undecided
Herton R. Krzesinski
Oneiric
Fix Released
Undecided
Unassigned
linux-2.6 (Debian)
Fix Released
Unknown

Bug Description

with 2.6.38-10 (and .39, doesn't happen with 2.6.38-9; does that mean it's a regression?) it would panic after ejecting a drive in under most circumstances, by doing cat /dev/zero > /media/junk/junkfile, killing it after a few seconds, then using "safely remove drive" in nautilus for that same drive; it will cause the panic

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-2.6.38-10-generic 2.6.38-10.44
ProcVersionSignature: Ubuntu 2.6.38-9.43-generic 2.6.38.4
Uname: Linux 2.6.38-9-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: CONEXANT Analog [CONEXANT Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ohsix 1739 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xd4700000 irq 45'
   Mixer name : 'Intel Cantiga HDMI'
   Components : 'HDA:14f15051,103c360b,00100000 HDA:80862802,80860101,00100000'
   Controls : 18
   Simple ctrls : 8
Date: Mon Jun 6 16:27:53 2011
HibernationDevice: RESUME=UUID=0476e25b-b0ac-46c5-9e1b-0c43dc60faa1
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
MachineType: Hewlett-Packard Compaq Presario CQ60 Notebook PC
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-9-generic root=UUID=d332acb1-e554-4637-84fe-731b5e39212b ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-2.6.38-9-generic N/A
 linux-backports-modules-2.6.38-9-generic N/A
 linux-firmware 1.52
SourcePackage: linux
UpgradeStatus: Upgraded to natty on 2011-03-17 (81 days ago)
dmi.bios.date: 12/15/2010
dmi.bios.vendor: Hewlett-Packard
dmi.bios.version: F.65
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: 3612
dmi.board.vendor: Hewlett-Packard
dmi.board.version: 09.67
dmi.chassis.asset.tag: Chassis Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: Hewlett-Packard
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnHewlett-Packard:bvrF.65:bd12/15/2010:svnHewlett-Packard:pnCompaqPresarioCQ60NotebookPC:pvrPCID:rvnHewlett-Packard:rn3612:rvr09.67:cvnHewlett-Packard:ct10:cvrChassisVersion:
dmi.product.name: Compaq Presario CQ60 Notebook PC
dmi.product.version: PCID
dmi.sys.vendor: Hewlett-Packard

Revision history for this message
steubens (steubens) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Andy Whitcroft (apw)
tags: added: regression-update
Revision history for this message
Andy Whitcroft (apw) wrote :

Note that a picture of the panic is attached.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Seems like the issue lies on something related to block/scsi changes.

@steubens, can you do the following, so we can understand better and isolate the problem:

- Install amd64 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.38.7-natty/ , and confirm the issue still happens with a pristine 2.6.38.7
- If 2.6.38.7 fails too, then please try installing 2.6.38.8 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.38.8-natty/

If 2.6.38.8 also fails, then better identify which commit is causing the issue (and do a bisect from current changes), in this case please install and test the kernel from http://people.canonical.com/~herton/lp793796/, it's the first bisect step. Otherwise if 2.6.38.8 works, then I'll try to build a new kernel with probable fix.

Please test and let us know the results, just ask for any doubt, thanks!

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Either there are various device drivers all over with related bugs or this is a bug in common code they all use.

Look at the JPG attachment of the panic: "scheduling while atomic" during __do_softirq

LP: #790054 during bcmwl wifi driver __do_softirq "BUG: scheduling while atomic" goes off.

http://lkml.org/lkml/2011/5/13/393
BUG: scheduling while atomic 2.6.39-rc7 (iwl3945_irq_tasklet)
During iwl3945 wifi driver __do_softirq he gets "scheduling while atomic"

http://lkml.org/lkml/2011/5/2/135
Re: rtlwifi: regression 39-rc5 (rtl8192ce)
During rtlwifi driver __do_softirq he gets "scheduling while atomic"

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Head of thread on recent RCU updates:
http://lkml.org/lkml/2011/6/8/321

from that:

4. Restore checks for blocking within RCU read-side critical sections under CONFIG_PROVE_RCU.

More commentary about that here:
https://lkml.org/lkml/2011/6/4/115
http://lkml.org/lkml/2011/6/8/354
http://paulmck.livejournal.com/27219.html
http://kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUdp.html

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

Hmm, there may be a plethora of rcu bugs in individual drivers as well as global ones?

http://www.spinics.net/lists/linux-wireless/msg69597.html

Revision history for this message
Herton R. Krzesinski (herton) wrote :

The "scheduling while atomic" should be happening in this case because a panic happens in atomic context, and the panic calls its notifier list which contains drm_fb_helper_panic (the switch to console on modesetting framebuffer).

drm_fb_helper_panic calls drm_fb_helper_force_kernel_mode->drm_crtc_helper_set_config

drm_crtc_helper_set_config has kzalloc calls with GFP_KERNEL, which means it has __GFP_WAIT set. This makes it the allocation routines call might_sleep, which calls _cond_resched and may be schedule()... in this case schedule was called so you get this trace.

One fix to this would be turning the kzalloc calls to use GFP_ATOMIC instead, so we avoid kzalloc calling schedule. Anyway, this is not the regression reported here, just this scheduling while atomic makes it harder to see the real problem (the panic that should scrolled up on the screen).

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens, any success running the tests I requested on comment #3? I tried reproduce your problem but was unable to, so it would be great/necessary if you could do the tests, especially if you reproduce easily.

Revision history for this message
steubens (steubens) wrote :

i'll see to running them asap, as to the other comments; as far as i understand the "panic" from the mode switch is scheduling while atomic, but the error that caused the mode switch (to display the other panic) is from the elv_ / block layer further down the stack

Revision history for this message
steubens (steubens) wrote :

ok, both the .7 and the .8 kernel do panic; just as easily

i'm using a 1tb bus powered usb drive and i wait for at least 20 seconds while cat writes to the junk file, it does it with my other usb drive as well

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens: ok, lets try to bisect which change broke in your case, can you try the first kernel at http://people.canonical.com/~herton/lp793796/ and tell me if it works or gives same issue?

Revision history for this message
steubens (steubens) wrote :

ok i will asap, how many steps does it say is left and where did you start/end the bisect; i can probably see about running the entire series locally (also going to see if it affects my netbook :)

Revision history for this message
steubens (steubens) wrote :

that kernel you prepared also panics; i tried automating the tests with udisks --unmount and --eject, but it wouldn't do it, so if you need to reproduce the test be sure to use "safely remove drive" from nautilus; i don't know what the distinction is but that's what i've been using

Revision history for this message
steubens (steubens) wrote :

i tried to reproduce on my netbook and it wasn't doing it; there's not a lot of ram in it and it isn't fast enough to really stack up a lot of i/o though; probably meaningless ;]

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens: ok, please try the next bisect step: http://people.canonical.com/~herton/lp793796/b2/

for this, git says:
Bisecting: 44 revisions left to test after this (roughly 6 steps)

current bisect log:
git bisect start
# good: [1463cecc1d1b35fb25ae11051184cc19f0254b32] UBUNTU: Ubuntu-2.6.38-9.43
git bisect good 1463cecc1d1b35fb25ae11051184cc19f0254b32
# bad: [cefe94dc5d171940edd23081d9d481dc1ed5824b] UBUNTU: Ubuntu-2.6.38-10.44
git bisect bad cefe94dc5d171940edd23081d9d481dc1ed5824b
# bad: [03f56cfce0cfc047045c003e181d09f89b6956e0] XZ decompressor: Fix decoding of empty LZMA2 streams
git bisect bad 03f56cfce0cfc047045c003e181d09f89b6956e0

Revision history for this message
steubens (steubens) wrote :

ok b2 works, does not panic, tried several times where it would only take one on the bad versions

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Thanks for the testing. Marked as good:

# good: [e350f7cf12149a8c79095268e1b6da14901fa754] m68k/mm: Set all online nodes in N_NORMAL_MEMORY
git bisect good e350f7cf12149a8c79095268e1b6da14901fa754

Bisecting: 22 revisions left to test after this (roughly 5 steps)

@steubens: please try next step at http://people.canonical.com/~herton/lp793796/b3/

Revision history for this message
Steve Conklin (sconklin) wrote :

It looks from the logs like this machine is running as vmware host. Could you please see whether the problem still happens when you are not running vmware?

Thanks

Revision history for this message
steubens (steubens) wrote :

b3 panics, sorry for the delay, it's my daily driver

Steve: i've never launched it on -10 so no modules are loaded, some daemons start though, could that still matter?

Revision history for this message
Herton R. Krzesinski (herton) wrote :

# bad: [a02e3b98ac916d3d544c55f644c566aae2ad1918] i2c-parport: Fix adapter list handling
git bisect bad a02e3b98ac916d3d544c55f644c566aae2ad1918

Bisecting: 10 revisions left to test after this (roughly 4 steps)
[e1594df7d3d13d12036bd5fb40d13037f9f635dd] iwlegacy: fix tx_power initialization

@steubens: next bisect step now at http://people.canonical.com/~herton/lp793796/b4/ , please try that one

Revision history for this message
Steve Conklin (sconklin) wrote :

Yes, the modules are still loaded. Look at ProcModules.txt above, gathered from the system. There are vm* modules loaded, which are from vmware.

Revision history for this message
steubens (steubens) wrote :

yea but i made the report from -9 ... but i'll make sure they're not loaded when i test b4

Revision history for this message
steubens (steubens) wrote :

jeeze, launchpad is not pleasant over gprs; tried b4, it does not panic.

i also verified that no vmware modules were loading on -10, as i've never started it there, kernel panicked quite quickly after i installed it & started regular routine stuff :]

Revision history for this message
Steve Conklin (sconklin) wrote :

ok, thanks for the update. That's valuable information. The best thing we can do is keep bisecting.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Progressing forward with the bisect:

# good: [e1594df7d3d13d12036bd5fb40d13037f9f635dd] iwlegacy: fix tx_power initialization
git bisect good e1594df7d3d13d12036bd5fb40d13037f9f635dd

Bisecting: 5 revisions left to test after this (roughly 3 steps)
[a29ae2d0b5e069c4226b69a70d70198d809a6489] mpt2sas: prevent heap overflows and unchecked reads

@steubens: please try next step at http://people.canonical.com/~herton/lp793796/b5/

Revision history for this message
steubens (steubens) wrote :

b5 does not panic, sorry for the delay

Revision history for this message
Herton R. Krzesinski (herton) wrote :

At this point I suspect one of these two changes caused the regression, as I suspected initially:
scsi_dh: fix reference counting in scsi_dh_activate error path
put stricter guards on queue dead checks

although 2.6.38.8 has a fix for the second one, testing on comment #10 showed no good results for it.

Lets finish bisect at least, to confirm what change brought the problem:

# good: [a29ae2d0b5e069c4226b69a70d70198d809a6489] mpt2sas: prevent heap overflows and unchecked reads
git bisect good a29ae2d0b5e069c4226b69a70d70198d809a6489

Bisecting: 2 revisions left to test after this (roughly 2 steps)
[fbe388be5c48ecfef898567858e5d8f8ab2396b9] ALSA: HDA: Fix automute for Gateway NV79

@steubens: please try next bisect step at http://people.canonical.com/~herton/lp793796/b6/

Revision history for this message
steubens (steubens) wrote :

ok b6 panics; but i noticed the stack was different this time, it never makes the elv_ call it's all in scsi stuff

i wasn't paying much attention to the actual output with the other ones, when i have access to a notcrap camera again i'll see if they're different as well

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Thanks for testing, next step:

# bad: [fbe388be5c48ecfef898567858e5d8f8ab2396b9] ALSA: HDA: Fix automute for Gateway NV79
git bisect bad fbe388be5c48ecfef898567858e5d8f8ab2396b9

Bisecting: 0 revisions left to test after this (roughly 1 step)
[39a0cfed63b656486fb2feee063aa033816a90e0] put stricter guards on queue dead checks

@steubens: Please test this step at http://people.canonical.com/~herton/lp793796/b7/

Revision history for this message
steubens (steubens) wrote :

b7 panics

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Ok, should be last step now:

# bad: [39a0cfed63b656486fb2feee063aa033816a90e0] put stricter guards on queue dead checks
git bisect bad 39a0cfed63b656486fb2feee063aa033816a90e0

Bisecting: 0 revisions left to test after this (roughly 0 steps)
[048e21a4c8f39ccbbbf6b3f7cb65eff1fe209f8b] scsi_dh: fix reference counting in scsi_dh_activate error path

@steubens: please test final step at http://people.canonical.com/~herton/lp793796/b8/

Revision history for this message
steubens (steubens) wrote :

b8 does not panic

the last 3 or so revs that haven't panicked seemed to have markedly better io throughput to the drive i am using, compared to the ones that panic; but b8 also gave me an io error on the junk file i've been using on the drive; i'll inspect it and see if it wasn't just the drive asap

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Ok, bisect found this commit as the faulty one:
put stricter guards on queue dead checks

2.6.38.8 has a bug fix for this specific commit, but as you tested on comment #10 it seems you still have the problem even with 2.6.38.8, so another fix probably is required. Just to confirm, I'll cherry-pick the fix from 2.6.38.8 and ask you to test a kernel with it.

Revision history for this message
steubens (steubens) wrote :

ok, i also checked that drive and it was fine; but the io error was probably a fluke

Revision history for this message
Herton R. Krzesinski (herton) wrote :

To clear up the situation here, this is what we have so far (note, commit hashes are from the Ubuntu natty git kernel tree):

Linux 2.6.38-10.44, which is based on 2.6.38.7, has the regression reported here. Bisecting it, we found that regression is related to patch

commit 39a0cfed63b656486fb2feee063aa033816a90e0
put stricter guards on queue dead checks

This commit has a follow up fix, which is already in the Linux 2.6.38-10.44 kernel in Ubuntu:

commit 57bd324dbd799b271cad945224df5a21b151297b
fix oops in scsi_run_queue()

So even with this follow up fix, we still have the crash. The fix that are in the 2.6.38.8 kernel upstream and could help us here is the commit titled "block: add proper state guards to __elv_next_request", but testing on 2.6.38.8 shows that the regression still happens (comment #10). So we have to investigate further, but meanwhile the commits 39a0cfed63b656486fb2feee063aa033816a90e0 and 57bd324dbd799b271cad945224df5a21b151297b can be reverted to avoid the regression.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens: can you try the kernel at http://people.canonical.com/~herton/lp793796/r1/ ?

Report here if you still get the crash with it. If you get the crash, please try to get the oops info/picture of the screen and post here.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Natty):
status: New → In Progress
assignee: nobody → Herton R. Krzesinski (herton)
Changed in linux (Ubuntu Oneiric):
status: Confirmed → Fix Released
Revision history for this message
steubens (steubens) wrote :

r1 still panics, that's a bit disheartening

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Sorry about that, I picked the wrong fix to be tested.

@steubens: can you try the kernel I put at http://people.canonical.com/~herton/lp793796/r2/ ? I expect it to fix the issue, please report the results here.

Revision history for this message
steubens (steubens) wrote :

ok, r2 panics

i'm thinking that the different stacks in one of the panics (that i noticed) might mean there are 2 problems with the same steps to reproduce, if you can give me instructions on how to run the series again i can bang it out here and mind the different panic signatures

only thing i'm fuzzy on is how to make .debs from git or what tree to pull from if not mainline

also, am i understanding correctly that whatever b8 was, could be the new start point to skip a handful of steps?

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens: ok, scsi plus block fix from comment #36 should be needed (without one or another you can hit problems at different locations), but on the trace it seems you're not reaching the conditions that they fix. At least I think you need one more fix, I built a kernel with the probable fix, but first as a sanity check, please test the kernel at:

http://people.canonical.com/~herton/lp793796/r3/

It contains the previous fix plus the block fix I requested you to test on comment #36, but you still should see the oops with it. Then please test next kernel at:

http://people.canonical.com/~herton/lp793796/r4

The r4 kernel contains an additional fix I made which should address the problem shown on the last stack trace you posted, let me know your results.

Revision history for this message
steubens (steubens) wrote :

r3 and r4 panic

Revision history for this message
steubens (steubens) wrote :
Revision history for this message
Herton R. Krzesinski (herton) wrote :

@steubens: ok, by the trace r4 kernel really fixed the problem of previous trace (thanks for the pictures, they were valuable to check), but you hit another crash. I made another additional fix, please test the kernel I uploaded at http://people.canonical.com/~herton/lp793796/r5/

Revision history for this message
mcelrath (bob+launchpad) wrote :

I just saw this bug last night after upgrading to the 2.6.38-10.46 kernel on Natty. For me it *appears* to manifest as a panic talking to my DVD-ROM, after I had done some I/O to a USB device and ejected it (I used the nautilus "safely remove device" on the USB stick). The panic occurred several hours later. The panic wasn't recorded in my logs, but some I/O errors immediately prior to the panic were (attached). Note that when this happened the computer was idle, and the DVD-ROM having the I/O errors was not used *at all*. (there's not even a disc in the drive). There is another disk on the same PATA bus that should have had I/O though.

Also, I'd like to offer my kudos to steubens and Herton, you guys are doing a great job of communicating and isolating the problem with the kernel bisections (which I know is tedious). I've had some argumentative encounters on launchpad, but you guys are doing it right! Great job!

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@mcelrath: it seems you have a different bug, may be related to libata. If you can get the oops, please open a new bug report with 'ubuntu-bug -p linux' from a terminal on the affected machine, thank you.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Hi steubens, any success in testing kernel from comment #43?

Revision history for this message
steubens (steubens) wrote :

r5 does not panic after 3 trials; currently running the kernel :]

Revision history for this message
Herton R. Krzesinski (herton) wrote :

@setubens: that's good news, thanks for your testing so far. While going to report this upstream (plus the fixes), I found another bug for the same problem (https://bugzilla.kernel.org/show_bug.cgi?id=38842). There is a new patch for the problem available, I would like you to test it and validate the new patch, can you try the kernel at http://people.canonical.com/~herton/lp793796/r6/ and report here if still everything is ok with it?

Revision history for this message
steubens (steubens) wrote :

r6 panics

Changed in linux-2.6 (Debian):
status: Unknown → Confirmed
Revision history for this message
Francisco Castillo (panchokoster) wrote :

same thing in ubuntu oernic beta, kernel 3.0.0-12

Changed in linux-2.6 (Debian):
status: Confirmed → Fix Released
Revision history for this message
Herton R. Krzesinski (herton) wrote :

The reverts are applied for some time now on Natty, leaving as fix released.

Changed in linux (Ubuntu Natty):
status: In Progress → Fix Released
To post a comment you must log in.