SATA drive freezes when using LVM over dm-crypt
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
linux-source-2.6.22 (Ubuntu) |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Hi,
I can reproduce this bug on various kernels/systems (including Debian stable, Debian testing and Kubuntu 7.10) and I am a bit unsure, if this is a SATA driver, a dm-crypt devicemapper or a LVM problem.
After initial booting and with complete (non-network) based installation of Kubuntu 7.10 drive access works normal. After a few minutes, the system freezes up completely, showing heaps of SATA errors in the logfile (see below). After a few extra rounds producing errors, the drive then reactivates and works normal for as long as I have been using the system (a few hours). As I said before, this behavior is reproducible over several kernel versions and Distributions. I am using a LVM over dm-crypt installation with the following layout:
SCSI1 (0,0,0) #1 primary 67.1 GB ntfs
#2 primary 510 MB ext2 /boot
#3 primary 182.4 GB crypto (sda3_crypt)
Encrypted Volume (sda3_crypt) 182.4 GB Linux device mapper
#1 182.4 GB lvm
LVM VG disk1, LV home 107.4 GB Linux device mapper
#1 107.4 GB ext2 /home
LVM VG disk1, LV swap 2.1 GB Linux device mapper
#1 2.1GB swap swap
LVM VG disk1, LV system 72.9 GB Linux device mapper
#1 72.9 GB ext2 /
Error messages as reported by dmesg:
[ 0.000000] Linux version 2.6.22-14-generic (buildd@palmer) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2))
... cont.
[ 22.606745] ata1: SATA max UDMA/133 cmd 0xf8860480 ctl 0xf88604a0 bmdma 0x0001d400 irq 18
[ 22.606748] ata2: SATA max UDMA/133 cmd 0xf8860580 ctl 0xf88605a0 bmdma 0x0001d408 irq 18
[ 23.071589] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 23.116317] ata1.00: ATA-7: ST3250620NS, 3.AEG, max UDMA/133
[ 23.116319] ata1.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32)
[ 23.182879] ata1.00: configured for UDMA/133
[ 23.491089] ata2: SATA link down (SStatus 0 SControl 300)
[ 23.491170] scsi 0:0:0:0: Direct-Access ATA ST3250620NS 3.AE PQ: 0 ANSI: 5
[ 23.491177] ata1: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
... cont.
[ 23.499267] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
[ 23.499277] sd 0:0:0:0: [sda] Write Protect is off
[ 23.499279] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 23.499289] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 23.499322] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
[ 23.499328] sd 0:0:0:0: [sda] Write Protect is off
[ 23.499329] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 23.499337] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 23.499341] sda: sda1 sda2 sda3
[ 23.514090] sd 0:0:0:0: [sda] Attached SCSI disk
[ 23.517291] sd 0:0:0:0: Attached scsi generic sg0 type 0
... cont.
[ 113.972000] ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
[ 113.972000] ata1: CPB 1: ctl_flags 0x1f, resp_flags 0x2
[ 113.972000] ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x2
[ 113.972000] ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x2
[ 113.972000] ata1: CPB 4: ctl_flags 0x1f, resp_flags 0x2
[ 113.972000] ata1: timeout waiting for ADMA IDLE, stat=0x400
[ 113.972000] ata1: timeout waiting for ADMA LEGACY, stat=0x400
[ 113.972000] ata1.00: exception Emask 0x0 SAct 0x1e SErr 0x200000 action 0x2 frozen
[ 113.972000] ata1.00: cmd 61/00:08:
[ 113.972000] res 40/00:00:
[ 113.972000] ata1.00: cmd 61/78:10:
[ 113.972000] res 40/00:00:
[ 113.972000] ata1.00: cmd 61/08:18:
[ 113.972000] res 40/00:00:
[ 113.972000] ata1.00: cmd 60/10:20:
[ 113.972000] res 40/00:00:
[ 114.284000] ata1: soft resetting port
[ 114.440000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 114.572000] ata1.00: configured for UDMA/133
[ 114.576000] ata1: EH complete
[ 114.576000] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
[ 114.576000] sd 0:0:0:0: [sda] Write Protect is off
[ 114.576000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 114.576000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
... cont.
[ 233.248000] ata1: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x4 next cpb idx 0x0
[ 233.248000] ata1: CPB 0: ctl_flags 0x1f, resp_flags 0x0
[ 233.248000] ata1: CPB 1: ctl_flags 0x1f, resp_flags 0x0
[ 233.248000] ata1: CPB 2: ctl_flags 0x1f, resp_flags 0x0
[ 233.248000] ata1: CPB 3: ctl_flags 0x1f, resp_flags 0x0
[ 233.248000] ata1: CPB 4: ctl_flags 0x1f, resp_flags 0x0
[ 233.248000] ata1: timeout waiting for ADMA IDLE, stat=0x400
[ 233.248000] ata1: timeout waiting for ADMA LEGACY, stat=0x400
[ 233.248000] ata1.00: NCQ disabled due to excessive errors
[ 233.248000] ata1.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x2 frozen
[ 233.248000] ata1.00: cmd 60/08:00:
[ 233.248000] res 40/00:00:
[ 233.248000] ata1.00: cmd 60/10:08:
[ 233.248000] res 40/00:00:
[ 233.248000] ata1.00: cmd 61/08:10:
[ 233.248000] res 40/00:00:
[ 233.248000] ata1.00: cmd 61/78:18:
[ 233.248000] res 40/00:00:
[ 233.248000] ata1.00: cmd 61/00:20:
[ 233.248000] res 40/00:00:
[ 233.560000] ata1: soft resetting port
[ 233.716000] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 233.828000] ata1.00: configured for UDMA/133
[ 233.828000] ata1: EH complete
[ 233.896000] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
[ 233.904000] sd 0:0:0:0: [sda] Write Protect is off
[ 233.904000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 233.920000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
From there (s)ata runs without further problems.
Steps to reproduce: Install any kernel > 2.6.18 and use a SATA disk with LVM over dm-crypt as documented above. Make use of disk. Voila.
Please advise what I can do to help narrow down/solve the problem. If it was me, I'd say this bug is "critical", since I do not know if the drive access works correctly and thus whether data storage is reliable.
Thank you for your support!
Cheers
Jens
PS: Of course I overlooked the line "[ 233.248000] ata1.00: NCQ disabled due to excessive errors". Could it be an NCQ problem? I have read there are several blacklisted drives in the driver source:
libata-core.c: /* NCQ hard hangs device under heavier load, needs hard power cycle */
libata-core.c: { "Maxtor 6B250S0", "BANC1B70", ATA_HORKAGE_NONCQ },
libata-core.c: { "HTS541060G9SA00", "MB3OC60D", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS541080G9SA00", "MB4OC60D", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS541010G9SA00", "MBZOC60D", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS541680J9SA00", "SB2IC7EP", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS541612J9SA00", "SBDIC7JP", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS722012K9A300", "DCCOC54P", ATA_HORKAGE_NONCQ, },
libata-core.c: { "HTS541616J9SA00", "SB4OC70P", ATA_HORKAGE_NONCQ, },
libata-core.c: { "WDC WD740ADFD-00NLR1", NULL, ATA_HORKAGE_NONCQ, },
libata-core.c: { "FUJITSU MHV2080BH", "00840028", ATA_HORKAGE_NONCQ, },
Mine is a Seagate ST3250620NS. Maybe it needs to be added to the list? Unfortunately there seems to be no kernel parameter to disable NCQ at boottime. How do I forward this bug to the libata/sata_nv guys with "possible NCQ issue"?
The problem *is* related to NCQ, however, I believe it is *not* the drive. I am running into the same problems on a different machine with a SAMSUNG SP2504C SATA drive.
--- snip --- 14.46-generic)
[ 0.000000] Linux version 2.6.22-14-generic (buildd@palmer) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #1 SMP Sun Oct 14 23:05:12 GMT 2007 (Ubuntu 2.6.22-
...
[ 4.936000] sata_nv 0000:00:08.0: version 3.4
[ 4.936000] sata_nv 0000:00:08.0: Using ADMA mode
...
[ 800.196000] ata1: timeout waiting for ADMA IDLE, stat=0x400 1c:03:3c/ 04:00:1b: 00:00/40 tag 0 cdb 0x0 data 524288 out 00:00:00/ 00:00:00: 00:00/00 Emask 0x4 (timeout) 1c:ff:3b/ 02:00:1b: 00:00/40 tag 1 cdb 0x0 data 262144 out 00:00:00/ 00:00:00: 00:00/00 Emask 0x4 (timeout) 1c:01:3c/ 02:00:1b: 00:00/40 tag 2 cdb 0x0 data 262144 out f4:75:14/ 00:00:1b: 00:00/40 tag 0 cdb 0x0 data 4096 in 00:00:00/ 00:00:00: 00:00/00 Emask 0x4 (timeout) bc:1e:fe/ 00:00:1a: 00:00/40 tag 1 cdb 0x0 data 4096 out 00:00:00/ 00:00:00: 00:00/00 Emask 0x4 (timeout) ac:1d:4a/ 00:00:1a: 00:00/40 tag 2 cdb 0x0 data 32768 out 00:00:00/ 00:00:00: 00:00/00 Emask 0x4 (timeout)
[ 800.196000] ata1: timeout waiting for ADMA LEGACY, stat=0x400
[ 800.196000] ata1.00: exception Emask 0x0 SAct 0x7ffff SErr 0x200000 action 0x2 frozen
[ 800.196000] ata1.00: cmd 61/00:00:
[ 800.196000] res 40/00:00:
[ 800.196000] ata1.00: cmd 61/00:08:
[ 800.196000] res 40/00:00:
[ 800.196000] ata1.00: cmd 61/00:10:
...
[ 918.972000] ata1: timeout waiting for ADMA IDLE, stat=0x400
[ 918.972000] ata1: timeout waiting for ADMA LEGACY, stat=0x400
[ 918.972000] ata1.00: NCQ disabled due to excessive errors
[ 918.972000] ata1.00: exception Emask 0x0 SAct 0x1fffff SErr 0x0 action 0x2 frozen
[ 918.972000] ata1.00: cmd 60/08:00:
[ 918.972000] res 40/00:00:
[ 918.972000] ata1.00: cmd 61/08:08:
[ 918.972000] res 40/00:00:
[ 918.972000] ata1.00: cmd 61/40:10:
[ 918.972000] res 40/00:00:
...
[ 919.284000] ata1: soft resetting port
[ 919.440000] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 919.960000] ata1.00: configured for UDMA/133
[ 919.960000] ata1: EH complete
[ 920.056000] sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
[ 920.056000] sd 0:0:0:0: [sda] Write Protect is off
[ 920.056000] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 920.056000] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
... and the drive is working again.
The common denominator here is: both machines use the NForce4 chipset, both use the sata_nv driver and both are trying to use LVM over dm-crypt.
I am of course not sure, but it is possible that there might be a bug combining sata_nv, NCQ and LVM mapped devices. I wish someone else would look into this issue. The drive does make funky noises during the error phase, but I am not sure if that is healthy at all.
Cheers
Jens