Port slow to respond on SiI3512 with sata_sil

Bug #159521 reported by Richard Appleby
10
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
linux-source-2.6.22 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: linux-source-2.6.22

Fresh install of the latest Ubuntu 7.10 Server i386 (Gutsy), with all fixes applied.

uname -a:
Linux house 2.6.22-14-server #1 SMP Sun Oct 14 23:34:23 GMT 2007 i686 GNU/Linux

System is a Jetway J7F4, with VIA raid on board, and a Silicon Image-based PCI Sata-raid card. System has 3x500GB sata drives installed, 2 on the motherboard, and one on the PCI card. The system was installed from a USB CD-drive, as it does not have a permanently attached CD-ROM drive. The drives were partitioned as 1GB, 512MB and "the rest" - about 499GB. Using software raid these were then formed into a 3 partition RAID1 set for /boot, a 3 partition RAID1 set for swap, and a 3 partition RAID5 set for / respectively. Install appeared to go normally, and on first reboot the raid arrays were rebuilt. Some errors from the ata driver (?) were reported on the console, but apart from significant slow downs in the rebuild rate (drops from nearly 50MB/s to less than 8MB/s) there appeared to be no problems. System was then lightly used for a couple of days (some minor initial configuration work) and again I noticed a very occasional error message on the console.

Stupidly, I didn't take note of the exact errors, but on examining my kern.log, I can see that they would have been related to errors such as the following (extracted from that file), which always relate to ata1, which is the sata drive plugged into the PCI raid card (sda) :

Oct 31 12:08:51 house kernel: [ 318.940000] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Oct 31 12:08:51 house kernel: [ 318.940000] ata1.00: cmd c8/00:00:b8:27:52/00:00:00:00:00/ea tag 0 cdb 0x0 data 131072 in
Oct 31 12:08:51 house kernel: [ 318.940000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 31 12:08:51 house kernel: [ 319.270000] ata1: soft resetting port
Oct 31 12:08:51 house kernel: [ 319.430000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 31 12:08:51 house kernel: [ 319.490000] ata1.00: configured for UDMA/100
Oct 31 12:08:51 house kernel: [ 319.490000] ata1: EH complete
Oct 31 12:08:51 house kernel: [ 319.510000] sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB)
Oct 31 12:08:51 house kernel: [ 319.520000] sd 0:0:0:0: [sda] Write Protect is off
Oct 31 12:08:51 house kernel: [ 319.520000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct 31 12:08:51 house kernel: [ 319.550000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Then, last night I tried to copy some 12-15GB of data to the system, and I noticed *many* errors being echoed to the console, again all related to ata1. At some point in the process it appears that the ata driver was unable to reset the port, even using a hard reset, and the drive was "disabled", which caused the software raid system to remove that drives partitions from the raid sets. Fortunately the system continued to run on the other drives, but I couldn't get the ata1 drive up again. I needed to reboot the box to regain access to the drive. I left the system rebuilding the raid sets in single-user mode this morning ... no errors were apparent on the log or console at that time, but I will add anything I find when I get home this evening.

This problem looks similar to several other bugs in the system, though there are differences between this and them, as follows:
https://bugs.launchpad.net/ubuntu/+bug/64587 (and duplicates) ... their discussion seems indicate that a CD or DVD drive is involved
https://bugs.launchpad.net/ubuntu/+bug/84603 (and duplicates) ... again, discussion seems indicate that a CD or DVD drive is involved
https://bugs.launchpad.net/ubuntu/+bug/75295 (and duplicates) ... again, discussion seems indicate that a CD or DVD drive is involved
https://bugs.launchpad.net/ubuntu/+bug/103277 ... again discussion seems indicate that a CD or DVD drive is involved
https://bugs.launchpad.net/ubuntu/+bug/121612 ... problems with sata_sil reported, but not a good match to the symptoms I see
http://bugzilla.kernel.org/show_bug.cgi?id=8316 ... seems closely related to 84603, and again seems focused on a CD or DVD drive being involved.

Will attach kern.log (from time of install to fail of raid system last night), the output from lspci -vv and hdparm -I next.

Revision history for this message
Richard Appleby (disposable01) wrote :
Revision history for this message
Richard Appleby (disposable01) wrote :
Revision history for this message
Richard Appleby (disposable01) wrote :
Revision history for this message
Richard Appleby (disposable01) wrote :

System completed its rebuild of the raidsets in single-user mode with no errors of any description. Looks like there may be some correlation between usage, and also whether or not the system is operating in single-user or multi-user mode, and the errors.

Also, I forgot to add:

root@house:/# cat /proc/version_signature:
Ubuntu 2.6.22-14.46-server

Let me know what else I can do to help debug this. Thanks.

Revision history for this message
Richard Appleby (disposable01) wrote :

In case it helps, I checked the SMART status of the drives too. Only the drive associated with ata1 has errors in its SMART error log. However, the drive is reporting as healthy, and running a long selftest produced no errors, and no change in the "healthy" state of the drive according to SMART. I dont know enough about this to understand if the errors in the log are causing the problems I'm seeing, or as a result of them. I've attached the output from running "sudo smartctl --all" for each of the drives to this post in the hope it will be of use.

Revision history for this message
Richard Appleby (disposable01) wrote :

I went back to 7.04 server (feisty) and it behaves the same way, so this doesn't look like something that's new into 7.10 server.

Also, thanks to running a s/w raid environment, I have some freedom to swap cabling around to eliminate possible hardware errors. So far I have swapped the drives around, and the errors stay with the drive connected to the PCI sata raid card (eliminating the possibility of this being due to a hardware error in one of the drives). I have also swapped ports on the PCI sata raid card, but this time the errors moved to the other port; which really doesn't indicate much.

I think I need to eliminate the possibility of the PCI raid card being faulty next, so I'll try a copy of FC7, and see how that goes, but given the common kernel / driver sources across Linux distros, I suspect I may need to install something like Windows on it to see if the card is really OK or not. All very frustrating.

Revision history for this message
Richard Appleby (disposable01) wrote :

FC7 took an eternity to install (over 5 hours) and although I can't be sure (no logs that I could find from the install process), I suspect it was struggling with the same disk "freeze" problems that I've been experiencing under Ubuntu. When it finally finished installing, I rebooted, and it Kernel oops'd because it couldn't bring the raid array online - another hint that it was having big problems with the disks.

At this point I gave up and went to Windows. Which works flawlessly - or at least it does so far ... I'm still throwing files around, but it's certainly looking like the hardware is fine, but that there is a problem somewhere in the Linux support for it. I think I'm out of options on how to move this forward myself now.

I need some help please, as I really don't want to have to build this server around Windows! :-)

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Hardy Heron Alpha2 release will be coming out soon. It will have an updated version of the kernel. It would be great if you could test with this new release and verify if this issue still exists. I'll be sure to update this report when Alpha2 is available. Thanks!

Changed in linux:
status: New → Incomplete
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hardy Heron Alpha2 was recently released. It contains an updated version of the kernel. You can download and try the new Hardy Heron Alpha2 release from http://cdimage.ubuntu.com/releases/hardy/alpha-2/ . You should be able to then test the new kernel via the LiveCD. If you can, please verify if this bug still exists or not and report back your results. General information regarding the release can also be found here: http://www.ubuntu.com/testing/hardy/alpha2 . Thanks!

Revision history for this message
Richard Appleby (disposable01) wrote :

Thanks for coming back to me on this - I'll need to back up the current system before I can try this suggestion, but with a little luck I should be able to give this a try in the next few days. Will post as soon as I have news.

Revision history for this message
Richard Appleby (disposable01) wrote :

Sorry for delay in responding. The live CD has no support for RAID, so in the end I had to completely reinstall with the new Hardy Alpha2 code, which took a lot more time. Unfortunately there was no improvement at all in the symptoms - all the same problems previously described (timeout, device error, HSM violation etc etc) showed up during the install. Progress was so bad that in the end I terminated the install (of the base OS) after about 2 hours, as it was making no headway against the constant flow of disk errors (as seen in the logs accessed via the CTRL-ALT-F4 key). Frustratingly I couldn't find any way to capture those logs, as I didn't seem able to mount a USB key to the installing system - that could have been my lack of knowledge/problem, but annoying all the same.

Since this is my main server I can't keep this as a test system just for this problem, so I've had to restore my Gutsy system again. I'm now running on only the two integrated "Via" SATA controllers in a RAID 1 configuration (rather than RAID 5) to avoid the drive attached to the Silicon Image PCI card. Since I am now nominally "not using" the 500GB drive attached to the Silicon Image PCI card, I can probably test further suggestions much more easily, as there ought to be no need to install/reimage the server each time, and a live CD ought to be able to get at the Silicon Image controller/drive now.

Let me know what more I can do to help. Thanks

Revision history for this message
Richard Appleby (disposable01) wrote :

For what its worth, I decided to copy some 10GB of data from my RAID1 drives, over to the drive attached to the PCI card, which is now NOT part of a raid set. I tarred the data up, and ran it through bzip2, which resulted in the system being completely CPU bound (very low-powered CPU), and limiting the speed of the transfer. It took some 6 hours to actually move all the data across, but I got no errors during the copy at all. I then copied the 10GB backup to another directory on the same drive, and this time I get all the same errors as previously reported. Currently estimating that it will take around 1 hour to move the data (with errors, freezes, etc) so much higher basic data rates. Looks like the sata_sil driver and the 3512 run into problems with high load or io rates.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

This report will remain open against the actively developed kernel and is being closed against linux-source-2.6.22. Thanks!

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: Incomplete → Triaged
Changed in linux-source-2.6.22:
status: New → Won't Fix
Revision history for this message
kiev1 (sys-sys-admin) wrote :
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Richard Appleby (disposable01) wrote :

Apologies for the long delay in responding.

Because of the nature of this machine (no CD-drive, and no keyboard, mouse or screen) I'd like to try to test this using option (1). However, I'm a little confused as to the process for getting hold of the correct kernel packages; I see that Ben Collins has posted what look like the right packages here: http://kernel.ubuntu.com/pub/next/2.6.27-rc3/hardy/ ... am I right in assuming that I simply need to install the following two packages:
- linux-headers-2.6.27-1-server_2.6.27-1.1_i386.deb and
- linux-image-2.6.27-1-server_2.6.27-1.1_i386.deb
and then to run sudo grub-update?

Or do I need to follow another process, and if so, what?!

Thanks

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Triaged a while ago but has not had any updated comments for quite some time. Please let us know if this issue remains in the current Ubuntu release, http://www.ubuntu.com/getubuntu/download . If the issue remains, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Richard Appleby (disposable01) wrote :

This problem is fixed as of Ubuntu 9.04 (Jaunty) - apologies, I should have closed this off.

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.