kernel_misc from ubuntu_ltp timeout on F/B-kvm because of tpci test

Bug #1947422 reported by Kelsey Steele
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

kernel_misc is failing with a timeout on both Focal/kvm and Bionic/kvm. Host is possibly crashing or hanging during test.

Found on these hosts: autovm2, autovm4, and Larry

85304 09:39:51 DEBUG| Persistent state client._record_indent now set to 1
85305 09:39:51 DEBUG| Persistent state client.unexpected_reboot deleted
85306 09:39:51 DEBUG| Test has timeout: 7200 sec.
85307 09:39:51 INFO | START ubuntu_ltp.kernel_misc ubuntu_ltp.kernel_misc timestamp=1634204391 timeout=7200 localtime=Oct 14 09:39:51

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (3.7 KiB)

Tried to reproduce this on B-KVM (4.15.0-1100-kvm) on a kvm instance zeppo, this test suite is failing with the "tpci" test.

It looks like it will hang with PCI bus 00 slot 20:
test_pci 16 TCONF : tpci.c:74: PCI bus 00 slot 09 : Test-case '16'
test_pci 17 TPASS : PCI bus 00 slot 18 : Test-case '1'
test_pci 18 TPASS : PCI bus 00 slot 18 : Test-case '2'
test_pci 19 TPASS : PCI bus 00 slot 18 : Test-case '3'
test_pci 20 TPASS : PCI bus 00 slot 18 : Test-case '4'
test_pci 21 TPASS : PCI bus 00 slot 18 : Test-case '5'
test_pci 22 TPASS : PCI bus 00 slot 18 : Test-case '6'
test_pci 23 TPASS : PCI bus 00 slot 18 : Test-case '7'
test_pci 24 TPASS : PCI bus 00 slot 18 : Test-case '8'
test_pci 25 TPASS : PCI bus 00 slot 18 : Test-case '9'
test_pci 26 TPASS : PCI bus 00 slot 18 : Test-case '10'
test_pci 27 TPASS : PCI bus 00 slot 18 : Test-case '11'
test_pci 28 TPASS : PCI bus 00 slot 18 : Test-case '12'
test_pci 29 TPASS : PCI bus 00 slot 18 : Test-case '13'
test_pci 30 TPASS : PCI bus 00 slot 18 : Test-case '14'
test_pci 31 TPASS : PCI bus 00 slot 18 : Test-case '15'
test_pci 32 TCONF : tpci.c:74: PCI bus 00 slot 18 : Test-case '16'
test_pci 33 TPASS : PCI bus 00 slot 20 : Test-case '1'
test_pci 34 TPASS : PCI bus 00 slot 20 : Test-case '2'
test_pci 35 TPASS : PCI bus 00 slot 20 : Test-case '3'
test_pci 36 TPASS : PCI bus 00 slot 20 : Test-case '4'
test_pci 37 TPASS : PCI bus 00 slot 20 : Test-case '5'
test_pci 38 TPASS : PCI bus 00 slot 20 : Test-case '6'
test_pci 39 TPASS : PCI bus 00 slot 20 : Test-case '7'
test_pci 40 TPASS : PCI bus 00 slot 20 : Test-case '8'
test_pci 41 TPASS : PCI bus 00 slot 20 : Test-case '9'
test_pci 42 TPASS : PCI bus 00 slot 20 : Test-case '10'
test_pci 43 TPASS : PCI bus 00 slot 20 : Test-case '11'
(Test hangs here)

Output from virsh console:
Oct 19 08:02:25 zeppo kernel: ltp_tpci: test-case 13
Oct 19 08:02:25 zeppo kernel: ltp_tpci: save state
Oct 19 08:02:25 zeppo kernel: ltp_tpci: saved state of device
Oct 19 08:02:25 zeppo kernel: ltp_tpci: test-case 14
Oct 19 08:02:25 zeppo kernel: ltp_tpci: restore state
Oct 19 08:02:25 zeppo kernel: ltp_tpci: test-case 15
Oct 19 08:02:25 zeppo kernel: ltp_tpci: find device capability
Oct 19 08:02:25 zeppo kernel: ltp_tpci: device has PM capability
Oct 19 08:02:25 zeppo kernel: ltp_tpci: test-case 16
Oct 19 08:02:25 zeppo kernel: ltp_tpci: device doesn't have PCI-EXP capability
Oct 19 08:02:26 zeppo kernel: ltp_tpci: device released
Oct 19 08:02:26 zeppo kernel:
(no output here, ssh session will not respond at this point. Console is still working, but if you try to do anything it will freeze as well)

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02)
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010]
00:01.2 USB controller [0c03]: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] [8086:7020] (rev 01)
00:01.3 Bridge [0680]: Intel Corporation 82...

Read more...

Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

Found also on 2021.10.18/impish/linux-oracle/5.13.0-1009.11

tags: added: impish oracle sru-20211018
Po-Hsu Lin (cypressyew)
summary: - kernel_misc from ubuntu_ltp timeout on F/B-kvm
+ kernel_misc from ubuntu_ltp timeout on F/B-kvm because of tpci test
Revision history for this message
Ian May (ian-may) wrote :

Found on impish/linux-aws: 5.13.0-1007.8

tags: added: aws sru-20211108
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Found on F-aws-5.13 5.13.0-1008.9~20.04.2 with specific instances:
  * c3.xlarge
  * c4.large
  * c5n.large
  * i3en.24xlarge
  * m5a.large
  * r5.large

The tpci test on these instances all ends up with incomplete test report and cause kernel-results.xml.[failed-to-read] issue on our jenkins

tags: added: 5.13 sru-20211129
Po-Hsu Lin (cypressyew)
tags: added: 5.15 sru-20220530
Revision history for this message
Cory Todd (corytodd) wrote :

Found on bionic/linux-kvm/4.15.0-1126.131

Revision history for this message
John Cabaj (john-cabaj) wrote :

Found on bionic/linux-kvm/4.15.0-1129.134

Revision history for this message
Roxana Nicolescu (roxanan) wrote :

Found on focal/linux-kvm/5.4.0-1085.91

This test was removed for linux-kvm to let other test run and see the complete results.
https://code.launchpad.net/~fginther/+git/autotest-client-tests/+merge/436101

Revision history for this message
Roxana Nicolescu (roxanan) wrote (last edit ):

I managed to reproduce it on autovm1.maas running linux-kvm-5.4.0-1085.91.
It always freezes when test 12 is run (test_assign_resources) for 0000:00:04.0 device.

dmesg output
[ 1547.611739] ltp_tpci: assign resources
[ 1547.611741] ltp_tpci: assign resource #0
[ 1547.611745] ltp_tpci: name = 0000:00:04.0, flags = 262401, start 0xc120, end 0xc13f
[ 1547.611746] ltp_tpci: assign resource #1
[ 1547.611748] ltp_tpci: name = 0000:00:04.0, flags = 262656, start 0xfeb91000, end 0xfeb91fff
[ 1547.611750] ltp_tpci: assign resource #2
[ 1547.611752] ltp_tpci: name = 0000:00:04.0, flags = 0, start 0x0, end 0x0
[ 1547.611797] ltp_tpci: assign resource #3
[ 1547.611800] ltp_tpci: name = 0000:00:04.0, flags = 0, start 0x0, end 0x0
[ 1547.611801] ltp_tpci: assign resource #4
[ 1547.611803] ltp_tpci: name = 0000:00:04.0, flags = 1319436, start 0xfe004000, end 0xfe007fff
[ 1547.611832] virtio-pci 0000:00:04.0: BAR 4: releasing [mem 0xfe004000-0xfe007fff 64bit pref]
[ 1547.611864] virtio-pci 0000:00:04.0: BAR 4: assigned [mem 0x100000000-0x100003fff 64bit pref]

lspci -nv output for device 0000:00:04.0:
00:04.0 0200: 1af4:1000
        Subsystem: 1af4:0001
        Physical Slot: 4
        Flags: bus master, fast devsel, latency 0, IRQ 10
        I/O ports at c120 [size=32]
        Memory at feb91000 (32-bit, non-prefetchable) [size=4K]
        Memory at fe004000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at 80000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: virtio-pci

lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02)
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010]
00:01.2 USB controller [0c03]: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] [8086:7020] (rev 01)
00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 03)
00:02.0 VGA compatible controller [0300]: Cirrus Logic GD 5446 [1013:00b8]
00:03.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [1af4:1002]
00:04.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
00:05.0 Communication controller [0780]: Red Hat, Inc. Virtio console [1af4:1003]
00:06.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio block device [1af4:1001]

My assumption is that it started failing in 2021 after this commit was introduced d631e9caef6818cffae6e028191787efcf65b3da
https://lists.linux.it/pipermail/ltp/2021-June/023296.html

Before that commit pci_assign_resource call was failing because of insufficient space and the test suite would continue running.
Now, pci_assign_resource produces a freeze. Probably because this has a side effect. That also explain why we cannot connect to the machine since this is network device. I still cannot explain why this happens, because we do check before if it's safe to use that memory location (IORESOURCE_PREFETCH flag check).

Revision history for this message
Jacob Martin (jacobmartin) wrote (last edit ):

I was able to reproduce this issue in a local VM. The VM was created and managed using virt-manager on a Focal host. I could reproduce this on a guest running focal:linux, focal:linux-hwe-5.15, and focal:linux-kvm.

(5.15.0-60-generic) ubuntu@focal-vm-2:~$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02)
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010]
00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 03)
00:02.0 VGA compatible controller [0300]: Red Hat, Inc. QXL paravirtual graphic card [1b36:0100] (rev 04)
00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
00:04.0 Audio device [0403]: Intel Corporation 82801I (ICH9 Family) HD Audio Controller [8086:293e] (rev 03)
00:05.0 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 [8086:2934] (rev 03)
00:05.1 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 [8086:2935] (rev 03)
00:05.2 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 [8086:2936] (rev 03)
00:05.7 USB controller [0c03]: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 [8086:293a] (rev 03)
00:06.0 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)
00:08.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio block device [1af4:1001]
00:09.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [1af4:1002]
00:0a.0 Unclassified device [00ff]: Red Hat, Inc. Virtio RNG [1af4:1005]

I could only reproduce it while using the emulated i440FX chipset. This issue does not seem to apply to the Q35 chipset. Perhaps this is related to the i440FX only supporting PCI and the Q35 chipset supporting PCIe.

Additionally, I was able to reproduce this after running test 12 on either the 00:08.0 SCSI storage controller, or the 00:03.0 Ethernet controller.

With the SCSI storage controller, a system hang would occur when attempting read data from the disk that had not been cached (e.g. running a program that was not run since boot).

With the Ethernet controller, a system hang would occur if I set the network link down (i.e. # ip link set ens3 down), but not from a ping test, which did not have any successful pings get through, but also did not hang.

Revision history for this message
Jacob Martin (jacobmartin) wrote :

Something to note, with the Q35/PCIe the resource is reassigned to the same address range it had originally. While with the i440FX/PCI, the resource is reassigned to a different address range.

An example, both running test 12 on the respective storage controller:

With Q35 (and thus PCIe):
[ 29.366793] ltp_tpci: assign resource #4
[ 29.366793] ltp_tpci: name = 0000:03:00.0, flags = 1319436, start 0xfe600000, end 0xfe603fff
[ 29.366800] virtio-pci 0000:03:00.0: BAR 4: releasing [mem 0xfe600000-0xfe603fff 64bit pref]
[ 29.366805] virtio-pci 0000:03:00.0: BAR 4: assigned [mem 0xfe600000-0xfe603fff 64bit pref]
[ 29.369398] ltp_tpci: assign resource to '4', ret '0'

With i440FX (and thus PCI):
[ 389.243214] ltp_tpci: assign resource #4
[ 389.243214] ltp_tpci: name = 0000:00:08.0, flags = 1319436, start 0xfebf4000, end 0xfebf7fff
[ 389.243216] virtio-pci 0000:00:08.0: BAR 4: releasing [mem 0xfebf4000-0xfebf7fff 64bit pref]
[ 389.243223] virtio-pci 0000:00:08.0: BAR 4: assigned [mem 0x240004000-0x240007fff 64bit pref]
[ 389.244118] ltp_tpci: assign resource to '4', ret '0'

Maybe reassigning a PCI device resource changes address ranges that the higher level drivers expect to remain constant?

Po-Hsu Lin (cypressyew)
tags: added: ubuntu-ltp-kernel-misc
removed: ubuntu-ltp
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.