Nvidia MCP67 AHCI ata timeout exception with data loss

Bug #343919 reported by Ernst Persson
16
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
TJ

Bug Description

Binary package hint: linux-image-2.6.28-9-generic

I have now gotten about 10 of these ata timeout exceptions with Jaunty.
It has never happened on Intrepid, I also use the disk heavily from Windows and never seen any problems there.
It has only happened on my ext4 root filesystem, usually during a big dist-upgrade. I would guess that ext4 doesn't have anything to do with the problem, it's just that dist-upgrade is such a big disk cruncher and my Jaunty root filesystem happens to be ext4. But I can't know of cource.

I have an option in my BIOS if the SATA-support should be with AHCI or IDE, it has happened with both of them.
I've also tried attaching the harddrive to a different SATA port on my motherboard, didn't help.

It's very hard to reproduce, I did this
while true; do
git clone /home/ernst/test/linux-2.6
sync
rm -r linux-2.6
sync
done
for an hour and it didn't happen.

lspci -vv (only sata controller)
00:09.0 IDE interface: nVidia Corporation MCP67 AHCI Controller (rev a2) (prog-if 85 [Master SecO PriO])
 Subsystem: ABIT Computer Corp. Device 1c2f
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0 (750ns min, 250ns max)
 Interrupt: pin A routed to IRQ 2296
 Region 0: I/O ports at 09f0 [size=8]
 Region 1: I/O ports at 0bf0 [size=4]
 Region 2: I/O ports at 0970 [size=8]
 Region 3: I/O ports at 0b70 [size=4]
 Region 4: I/O ports at dc00 [size=16]
 Region 5: Memory at fe026000 (32-bit, non-prefetchable) [size=8K]
 Capabilities: [44] Power Management version 2
  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
  Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [8c] SATA HBA <?>
 Capabilities: [b0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable+
  Address: 00000000fee0300c Data: 4189
 Capabilities: [cc] HyperTransport: MSI Mapping Enable+ Fixed+
 Kernel driver in use: ahci

[ 1375.804551] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 1375.804566] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[ 1375.804568] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1375.804574] ata1.00: status: { DRDY }
[ 1375.804584] ata1: hard resetting link
[ 1376.288035] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1376.297962] ata1.00: configured for UDMA/133
[ 1376.297984] end_request: I/O error, dev sda, sector 476567766
[ 1376.298013] ata1: EH complete
[ 1376.298021] Aborting journal on device sda4:8.
[ 1376.298144] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors: (250 GB/232 GiB)
[ 1376.298181] sd 0:0:0:0: [sda] Write Protect is off
[ 1376.298186] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 1376.298236] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1376.301499] ext4_abort called.
[ 1376.301504] EXT4-fs error (device sda4): ext4_journal_start_sb: Detected aborted journal
[ 1376.301511] Remounting filesystem read-only

Tags: ext4
Revision history for this message
Ernst Persson (ernstp) wrote :
Revision history for this message
Ernst Persson (ernstp) wrote :
Revision history for this message
Ernst Persson (ernstp) wrote :
Revision history for this message
Ernst Persson (ernstp) wrote :
Revision history for this message
Ernst Persson (ernstp) wrote :

Now as you can see it's always a "timeout" exception. My first reaction to that is...
1) maybe it could wait a little longer?
2) is it really that bad? Try again?
3) has some timeout value changed between 2.6.27 and 2.6.28?

You can see that it's on different sectors each time.

"smartctl --all" for the disk looks fine, no errors ever.

Revision history for this message
Ernst Persson (ernstp) wrote :

The output of smartctl --all /dev/sda4
So the filesystem that's allways affected it my /dev/sda4 ext4 filesystem mounted on / , to summarise.
(I think, I haven't seen the problem with any other filesystem.)

Revision history for this message
TJ (tj) wrote :

This is the user's hardware listing (copied from a pastebin)

lspci -nn
00:00.0 RAM memory [0500]: nVidia Corporation MCP67 Memory Controller [10de:0547] (rev a2)
00:01.0 ISA bridge [0601]: nVidia Corporation MCP67 ISA Bridge [10de:0548] (rev a2)
00:01.1 SMBus [0c05]: nVidia Corporation MCP67 SMBus [10de:0542] (rev a2)
00:01.2 RAM memory [0500]: nVidia Corporation MCP67 Memory Controller [10de:0541] (rev a2)
00:02.0 USB Controller [0c03]: nVidia Corporation MCP67 OHCI USB 1.1 Controller [10de:055e] (rev a2)
00:02.1 USB Controller [0c03]: nVidia Corporation MCP67 EHCI USB 2.0 Controller [10de:055f] (rev a2)
00:04.0 USB Controller [0c03]: nVidia Corporation MCP67 OHCI USB 1.1 Controller [10de:055e] (rev a2)
00:04.1 USB Controller [0c03]: nVidia Corporation MCP67 EHCI USB 2.0 Controller [10de:055f] (rev a2)
00:06.0 IDE interface [0101]: nVidia Corporation MCP67 IDE Controller [10de:0560] (rev a1)
00:07.0 Audio device [0403]: nVidia Corporation MCP67 High Definition Audio [10de:055c] (rev a1)
00:08.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Bridge [10de:0561] (rev a2)
00:09.0 IDE interface [0101]: nVidia Corporation MCP67 AHCI Controller [10de:0550] (rev a2)
00:0a.0 Ethernet controller [0200]: nVidia Corporation MCP67 Ethernet [10de:054c] (rev a2)
00:0b.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0562] (rev a2)
00:0c.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0d.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0e.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:0f.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:10.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:11.0 PCI bridge [0604]: nVidia Corporation MCP67 PCI Express Bridge [10de:0563] (rev a2)
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration [1022:1100]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map [1022:1101]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller [1022:1102]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control [1022:1103]
02:00.0 VGA compatible controller [0300]: nVidia Corporation G70 [GeForce 7600 GT] [10de:0391] (rev a1)

Changed in linux (Ubuntu):
assignee: nobody → intuitivenipple
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Ernst Persson (ernstp) wrote :

Ah, now I did reproduce it, pretty quickly too! Now running 2.6.28-10-generic

Revision history for this message
TJ (tj) wrote :

There was a similar issue affecting the sata_mv driver for Marvell SATA chip-sets where an iport address for interrupt handling was incorrectly set. Although in this case a different driver is used (sata_nv) it is possible something similar to that issue might cause this one.

For reference, that commit is c42fae333255b08b8d4bc03e5853023145208d45 "sata_mv: fix 8-port timeouts on 508x/6081 chips"

Revision history for this message
Ernst Persson (ernstp) wrote :

Got it with 2.6.28-02062807-generic now also, left
while true; do dpkg -i *.deb; done
running over night with some packages.

Revision history for this message
Ernst Persson (ernstp) wrote :

It doesn't look like I'm using "sata_nv", more like I'm using the "ahci" driver:

00:09.0 IDE interface: nVidia Corporation MCP67 AHCI Controller (rev a2) (prog-if 85 [Master SecO PriO])
 Subsystem: ABIT Computer Corp. Device 1c2f
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0 (750ns min, 250ns max)
 Interrupt: pin A routed to IRQ 2296
 Region 0: I/O ports at 09f0 [size=8]
 Region 1: I/O ports at 0bf0 [size=4]
 Region 2: I/O ports at 0970 [size=8]
 Region 3: I/O ports at 0b70 [size=4]
 Region 4: I/O ports at dc00 [size=16]
 Region 5: Memory at fe026000 (32-bit, non-prefetchable) [size=8K]
 Capabilities: <access denied>
 Kernel driver in use: ahci

ernst@mammut:~$ dmesg | grep -i sata
[ 1.773681] ata1: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026100 irq 2296
[ 1.773683] ata2: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026180 irq 2296
[ 1.773685] ata3: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026200 irq 2296
[ 1.773687] ata4: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026280 irq 2296
[ 2.092016] ata1: SATA link down (SStatus 0 SControl 300)
[ 2.576012] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2.904014] ata3: SATA link down (SStatus 0 SControl 300)
[ 3.792014] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

ernst@mammut:~$ lsmod | grep sata

ernst@mammut:~$ dmesg | grep -i ahci
[ 1.772244] ahci 0000:00:09.0: version 3.0
[ 1.772636] ahci 0000:00:09.0: PCI INT A -> Link[APSI] -> GSI 23 (level, low) -> IRQ 23
[ 1.772679] ahci 0000:00:09.0: irq 2296 for MSI/MSI-X
[ 1.772736] ahci 0000:00:09.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl IDE mode
[ 1.772738] ahci 0000:00:09.0: flags: 64bit ncq sntf led clo pmp pio
[ 1.772741] ahci 0000:00:09.0: setting latency timer to 64
[ 1.773098] scsi0 : ahci
[ 1.773307] scsi1 : ahci
[ 1.773423] scsi2 : ahci
[ 1.773541] scsi3 : ahci

ernst@mammut:~$ lsmod | grep ahci

ernst@mammut:~$ lsmod | grep ata

Revision history for this message
TJ (tj) wrote :

ernst, thanks for pointing that out. I think I must have gained tunnel-vision after chasing that sata_mv similarity!

For reference, lsmod will no longer be so reliable for checking modules in Jaunty and later since we're now building a lot of modules into the kernel. sata_nv is one of those built-in to the kernel (grep SATA_NV /boot/config-`uname -r`) with the Ubuntu configured kernels:

grep SATA_NV /boot/config-2.6.28-10-generic
CONFIG_SATA_NV=y

But, the output of lspci plus the dmesg shows the ahci driver in use:

00:09.0 IDE interface [0101]: nVidia Corporation MCP67 AHCI Controller [10de:0550] (rev a2)

[ 1.764211] ahci 0000:00:09.0: version 3.0
[ 1.764595] ACPI: PCI Interrupt Link [APSI] enabled at IRQ 23
[ 1.764605] ahci 0000:00:09.0: PCI INT A -> Link[APSI] -> GSI 23 (level, low) -> IRQ 23
[ 1.764648] ahci 0000:00:09.0: irq 2296 for MSI/MSI-X
[ 1.764702] ahci 0000:00:09.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
[ 1.764705] ahci 0000:00:09.0: flags: 64bit ncq sntf led clo pmp pio
[ 1.764708] ahci 0000:00:09.0: setting latency timer to 64
[ 1.765066] scsi0 : ahci
[ 1.765282] scsi1 : ahci
[ 1.765392] scsi2 : ahci
[ 1.765502] scsi3 : ahci
[ 1.765624] ata1: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026100 irq 2296
[ 1.765626] ata2: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026180 irq 2296
[ 1.765628] ata3: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026200 irq 2296
[ 1.765630] ata4: SATA max UDMA/133 abar m8192@0xfe026000 port 0xfe026280 irq 2296
[ 2.248016] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2.252882] ata1.00: ATA-7: SAMSUNG SP2504C, VT100-33, max UDMA7

This could make it easier to find and diagnose since more people will be affected by a bug in ahci so it might be possible to find more bugs with the same symptoms and driver.

Revision history for this message
TJ (tj) wrote :

As a point to note right now - may mean something or nothing.

For each error log, no matter which kernel, it is accompanied by:

ext4_journal_start_sb: Detected aborted journal

And the bug reporter tells us that ext3 file-systems don't suffer the same issue.

Revision history for this message
Ernst Persson (ernstp) wrote :

Guess this should be posted in the kernel bugzilla also.

Ran dpkg straigth for 5 hours with 2.6.28-02062807-generic on ext3, no exceptions. Will continue to run it tonight.

Revision history for this message
Ernst Persson (ernstp) wrote :

Ran 7 hours more on ext3 with 2.6.28-02062807-generic, no exceptions.

Revision history for this message
Ernst Persson (ernstp) wrote :

Most similar thing I could find on kernel.org:
http://bugzilla.kernel.org/show_bug.cgi?id=11148

Revision history for this message
TJ (tj) wrote :

Can you try the work-arouns suggested in comment 8 ?

Add "libata.force=nohrst" to the kernel command-line while doing an 'ext4' test?

Revision history for this message
Ernst Persson (ernstp) wrote :

Still happened. Interesting note from the exception though:
[ 1212.290506] Aborting journal on device sda4:8.
[ 1212.293404] EXT4-fs error (device sda4) in ext4_reserve_inode_write: Journal has aborted

Revision history for this message
Gurubie (gurubie) wrote :

From my saved partition and chroot with Hardy 8.04 (you could use a live CD), this is how I fixed my Jaunty 9.04 fresh, beta install; where I could boot again. I look everywhere and tries everything. NONE of th buzybox commands worked for me. I do NOT have raid and their was no UUID or /dev naming error. Only the following worked for me and I still do not know why the newly installed kernal failed to boot after upgrading it. Perhaps it didn't upgrade or something because once I got back in using chroot, I had to run the apt-get fix command it recommends. Unless that was a limit of the chroot, I don't know. This was ugly and I almost gave up.

In a terminal (we're using sudo for root here)
Where "X" is your correct drive number, like /dev/sda2 for example

sudo mount /dev/sdaX /mnt

(You can cut and paste the commands below)

sudo mount --bind /dev /mnt/dev
sudo mount --bind /proc /mnt/proc
sudo mount --bind /sys /mnt/sys
sudo cp /etc/resolv.conf /mnt/etc/resolv.conf
sudo chroot /mnt /bin/bash

Then:

apt-get install ................................................etc.

I believe I did a:

apt-get update
(then the suggested fix command and...)

apt-get upgrade

While the kernel didn't appear to be different I not sure that it wasn't a newer fix. It booted. Now, after doing upgrades back in Jaunty again, I'm about to reboot and try the newest kernel (it just installed).

I hope this helps you as much as it did me!

Revision history for this message
Gurubie (gurubie) wrote :

My comment above......

...and it's workaround, may be more relevant to the "Alert! /dev/disk".... busybox failure to boot problem, than this thread which was said to be the original duplicate bug. That's why it's here. I do not know if this bug is in fact the cause of being dropped to busybox but the workaround worked for me and I just wanted to leave a trail for those of us that might wide up here. I posted this (previously VERY hard to find) workaround in the other bug duplicate and in the forums. Hopefully this will cause many to stick with (K)ubuntu. We had no other kernel to boot, once the upgrade process nixed the newly installed one. Now we can move ahead.

Obviously, this workaround can be used with any singular kernel boot failure (to get another one). Because SOMETIMES, another clean install wastes time (but clean installs are recommended).

Revision history for this message
Ernst Persson (ernstp) wrote :

Just for fun I compiled my own kernel with CONFIG_PREEMPT=y
Seems like this stops the error from happening on my system.

Revision history for this message
Mariusz Domański (mario.7) wrote :

I found somewhere on launchpad.net that adding all_generic_ide to kernel command line can be helpful and it worked in my case with very similar errors (although on Dell XPS m1530).

tags: added: ext4
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Ernst Persson,

Are you still using an ext4 filesystem? If so, do you still see this issue with the latest 2.6.28-15.48 Jaunty kernel? If so, I've backported some additional ext4 related patches and put some test kernels at:

http://people.canonical.com/~ogasawara/ext4-jaunty/

There is currently an amd64 test kernel there but I'll be adding i386 once it finishes building. Please let us know your results if you are able to test.

Also, the latest Karmic 9.10 Alpha images contain a newer 2.6.31 kernel with additional ext4 related patches. If you wanted to test that ISO CD images are available at http://cdimage.ubuntu.com/releases/karmic/ . Thanks.

Revision history for this message
Ernst Persson (ernstp) wrote :

I'm actually leaning towards that being a hardware problem and I have replaced that harddrive now. Never happened on another disk and I have a number of other ext4 filesystems now. Closing, but I can't test of cource!

Ernst Persson (ernstp)
Changed in linux (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.