sata_nv port resets on 2.6.20-16.29

Bug #129723 reported by Sean Dague
6
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned
linux-source-2.6.20 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-generic

Last night, while pushing some data via NFS, and some data via FTP over GigE to my data server (running fiesty, all upgrades). Something causes a port reset on sata_nv controller. This offlined 2 of 6 Seagate 320 GB drives that make up my RAID5 software array, putting the array into a broken state. I force assembled the array to get 5 drives online, which passes e2fsck, and am now in a 2 day rebuild cycle for the 6th drive after reading it.

Attached are the logs of the event. I had previously been running a custom 2.6.18.3 kernel, under all kinds of load, with no issues.

Jul 31 20:02:01 gallifrey -- MARK --
Jul 31 20:10:41 gallifrey kernel: [179384.992000] res 40/00:00:6f:b0:9d/00:00:00:00:00/00 Emask 0x4 (
timeout)
Jul 31 20:10:41 gallifrey kernel: [179384.992000] res 40/00:00:ff:b3:9d/00:00:00:00:00/00 Emask 0x4 (
timeout)
Jul 31 20:10:42 gallifrey kernel: [179385.304000] ata1: soft resetting port
Jul 31 20:10:42 gallifrey kernel: [179385.304000] ata2: soft resetting port
Jul 31 20:10:42 gallifrey kernel: [179385.460000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:10:42 gallifrey kernel: [179385.460000] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata1.00: qc timeout (cmd 0x27)
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata1.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata1: failed to recover some devices, retrying in 5 secs
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata2.00: qc timeout (cmd 0x27)
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata2.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:11:12 gallifrey kernel: [179415.468000] ata2: failed to recover some devices, retrying in 5 secs
Jul 31 20:11:17 gallifrey kernel: [179420.472000] ata1: hard resetting port
Jul 31 20:11:17 gallifrey kernel: [179420.472000] ata2: hard resetting port
Jul 31 20:11:17 gallifrey kernel: [179421.104000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:11:17 gallifrey kernel: [179421.104000] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata1.00: qc timeout (cmd 0x27)
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata1.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata1.00: limiting speed to UDMA/133:PIO3
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata1: failed to recover some devices, retrying in 5 secs
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata2.00: qc timeout (cmd 0x27)
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata2.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata2.00: limiting speed to UDMA/133:PIO3
Jul 31 20:11:47 gallifrey kernel: [179451.112000] ata2: failed to recover some devices, retrying in 5 secs
Jul 31 20:11:52 gallifrey kernel: [179456.116000] ata1: hard resetting port
Jul 31 20:11:52 gallifrey kernel: [179456.116000] ata2: hard resetting port
Jul 31 20:11:53 gallifrey kernel: [179456.748000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:11:53 gallifrey kernel: [179456.748000] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): starting (version 2.18.0.1), pid 16640 user 'sdague'
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory
" to a read-only configuration source at position 0
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): Resolved address "xml:readwrite:/home/sdague/.gconf" to a wri
table configuration source at position 1
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults"
 to a read-only configuration source at position 2
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): Resolved address "xml:readonly:/var/lib/gconf/debian.defaults
" to a read-only configuration source at position 3
Jul 31 20:12:14 gallifrey gconfd (sdague-16640): Resolved address "xml:readonly:/var/lib/gconf/defaults" to a
read-only configuration source at position 4
Jul 31 20:12:20 gallifrey gconfd (sdague-16640): Resolved address "xml:readwrite:/home/sdague/.gconf" to a wri
table configuration source at position 0
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata1.00: qc timeout (cmd 0x27)
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata1.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata1.00: disabled
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata2.00: qc timeout (cmd 0x27)
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata2.00: ata_hpa_resize 1: sectors = 625142448, hpa_sectors
= 0
Jul 31 20:12:23 gallifrey kernel: [179486.756000] ata2.00: disabled
Jul 31 20:12:24 gallifrey kernel: [179487.260000] ata1: EH complete
Jul 31 20:12:24 gallifrey kernel: [179487.260000] ata2: EH complete
Jul 31 20:12:24 gallifrey kernel: [179487.260000] sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.260000] end_request: I/O error, dev sda, sector 249584368
Jul 31 20:12:24 gallifrey kernel: [179487.260000] sd 1:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.260000] end_request: I/O error, dev sdb, sector 249583712
Jul 31 20:12:24 gallifrey kernel: [179487.260000] sd 1:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.260000] end_request: I/O error, dev sdb, sector 249584280
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sda, sector 249584632
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 1:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sdb, sector 249583104
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sda, sector 249584640
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584640 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584648 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584656 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584664 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584672 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584680 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584688 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584696 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584704 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584712 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584720 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584728 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584736 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584744 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584752 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584760 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584768 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584776 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584784 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584792 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584800 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584808 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584816 on s
da).
Jul 31 20:12:24 gallifrey kernel: [179487.264000] raid5:md0: read error not correctable (sector 249584824 on s
da).
...... (a lot of those)
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 1:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sdb, sector 249583704
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sda, sector 249583104
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 1:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sdb, sector 249584504
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 31 20:12:24 gallifrey kernel: [179487.264000] end_request: I/O error, dev sda, sector 249583704
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sda : READ CAPACITY failed.
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sda : status=0, message=00, host=4, driver=00
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sda : sense not available.
Jul 31 20:12:24 gallifrey kernel: [179487.264000] sda: Write Protect is off
Jul 31 20:12:24 gallifrey kernel: [179487.300000] RAID5 conf printout:
Jul 31 20:12:24 gallifrey kernel: [179487.300000] --- rd:6 wd:4
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 0, o:0, dev:sda
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 1, o:0, dev:sdb
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 2, o:1, dev:sdc
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 3, o:1, dev:sdd
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 4, o:1, dev:sde
Jul 31 20:12:24 gallifrey kernel: [179487.300000] disk 5, o:1, dev:sdf
Jul 31 20:12:24 gallifrey kernel: [179487.308000] RAID5 conf printout:
Jul 31 20:12:24 gallifrey kernel: [179487.308000] --- rd:6 wd:4
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 1, o:0, dev:sdb
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 2, o:1, dev:sdc
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 3, o:1, dev:sdd
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 4, o:1, dev:sde
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 5, o:1, dev:sdf
Jul 31 20:12:24 gallifrey kernel: [179487.308000] RAID5 conf printout:
Jul 31 20:12:24 gallifrey kernel: [179487.308000] --- rd:6 wd:4
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 1, o:0, dev:sdb
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 2, o:1, dev:sdc
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 3, o:1, dev:sdd
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 4, o:1, dev:sde
Jul 31 20:12:24 gallifrey kernel: [179487.308000] disk 5, o:1, dev:sdf
Jul 31 20:12:24 gallifrey kernel: [179487.324000] RAID5 conf printout:
Jul 31 20:12:24 gallifrey kernel: [179487.324000] --- rd:6 wd:4
Jul 31 20:12:24 gallifrey kernel: [179487.324000] disk 2, o:1, dev:sdc
Jul 31 20:12:24 gallifrey kernel: [179487.324000] disk 3, o:1, dev:sdd
Jul 31 20:12:24 gallifrey kernel: [179487.324000] disk 4, o:1, dev:sde
Jul 31 20:12:24 gallifrey kernel: [179487.324000] disk 5, o:1, dev:sdf
Jul 31 20:12:24 gallifrey kernel: [179487.324000] lost page write due to I/O error on md0
Jul 31 20:12:24 gallifrey last message repeated 9 times
Jul 31 20:12:55 gallifrey kernel: [179518.424000] printk: 5191 messages suppressed.
Jul 31 20:12:55 gallifrey kernel: [179518.424000] lost page write due to I/O error on md0
Jul 31 20:12:55 gallifrey last message repeated 5 times
Jul 31 20:13:00 gallifrey kernel: [179523.408000] printk: 14 messages suppressed.
Jul 31 20:13:00 gallifrey kernel: [179523.408000] lost page write due to I/O error on md0

Revision history for this message
CassieMoondust (cassie-lx) wrote :

Have the same bug with gutsy Kernel 2.6.22-14.46. Mainboard is an ASUS M2N32 WS Professional with the nVidia 590 Chipset.
System freezes with the same messages, after 30 seconds a soft or hard reset on the sata drives occures and the system runs normally until the next freeze.

Revision history for this message
CassieMoondust (cassie-lx) wrote :

I can happily say that the errors are gone after i change an sata calbe. My vendor has build not the original cable from asus into my system at the first sata-drive. ASUS delivers ht M2N32 WS Pro with eight (!) sata cables, so i do not understand why my vendor didn't use this and take another one? The system works now about two weeks 24 hours a day without this error and with two sata drives in a software raid1.
In my case this bug is solved, it is possible bad cables are the reason. If errors occures check your cable first or change them to another (better, high quality) ones, which are sata300 capable.

Revision history for this message
Ralph Janke (txwikinger) wrote :

Have you tried changing the cable?

If that does not work, could you please check this issue against the newest hardy kernel.

Thanks

Changed in linux-source-2.6.20:
status: New → Incomplete
Revision history for this message
kiev1 (sys-sys-admin) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote : This bug is now reported against the 'linux' package

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this bug to the new "linux" package. However, development has already began for the upcoming Intrepid Ibex 8.10 release. It would be helpful if you could test the upcoming release and verify if this is still an issue - http://www.ubuntu.com/testing . If the issue still exists, please update this report by changing the Status of the "linux" task from "Incomplete" to "New". We appreciate your patience and understanding as we make this transition. Thanks!

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Sean Dague (sdague) wrote :

The hardware that I had that had this issue met an untimely death in a lightning storm, so I can't verify any fixes any more for this bug.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Sean,

Thanks for the update. I'll go ahead and close this report since you no longer have the hardware available to debug/test with.

Kiev1 - from comments here it seems you may be the only one who might still be experiencing issues. But I'll follow up with you at the other bug report you filed. Thanks.

Changed in linux:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.