KVM system crashes after starting guest

Bug #1596635 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Fix Released
Undecided
Tim Gardner

Bug Description

== Comment: #0 - Chanh H. Nguyen - 2016-06-25 00:24:28 ==
We have Ubuntun 16.04.1 version on our SuperMicro system and some of the virtual packages installed. Define a guest with a pci passthrough is fine but then system crashes at xhci_irq+0x1bc/0xf50 after we start the guest....

7c:mon> e
cpu 0x7c: Vector: 300 (Data Access) at [c000001e1b80f760]
    pc: c00000000088217c: xhci_irq+0x1bc/0xf50
    lr: c000000000882050: xhci_irq+0x90/0xf50
    sp: c000001e1b80f9e0
   msr: 9000000102009033
   dar: 28
 dsisr: 40000000
  current = 0xc000001e1bc82a20
  paca = 0xc000000007b89a00 softe: 0 irq_happened: 0x01
    pid = 4026, comm = libvirtd
7c:mon> t
[c000001e1b80fb00] c00000000080ebb0 usb_hcd_irq+0x50/0xa0
[c000001e1b80fb30] c00000000082af58 usb_hcd_pci_remove+0x68/0x1c0
[c000001e1b80fb70] c00000000088a118 xhci_pci_remove+0x78/0xb0
[c000001e1b80fba0] c0000000005e54b0 pci_device_remove+0x70/0x110
[c000001e1b80fbe0] c0000000006d1550 __device_release_driver+0xc0/0x190
[c000001e1b80fc10] c0000000006d1660 device_release_driver+0x40/0x70
[c000001e1b80fc40] c0000000006cf860 unbind_store+0x170/0x1b0
[c000001e1b80fc80] c0000000006ce1d4 drv_attr_store+0x64/0xa0
[c000001e1b80fcc0] c0000000003978d0 sysfs_kf_write+0x80/0xb0
[c000001e1b80fd00] c0000000003967e8 kernfs_fop_write+0x188/0x200
[c000001e1b80fd50] c0000000002e126c __vfs_write+0x6c/0xe0
[c000001e1b80fd90] c0000000002e1fa0 vfs_write+0xc0/0x230
[c000001e1b80fde0] c0000000002e2fdc SyS_write+0x6c/0x110
[c000001e1b80fe30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff7f6e6708
SP (3fff7abfd520) is in userspace
7c:mon> r
R00 = c000000000882050 R16 = 00003fff7a400000
R01 = c000001e1b80f9e0 R17 = c000000000df4200
R02 = c0000000015b4200 R18 = c000000000b84200
R03 = d000080081560024 R19 = c000000000de4200
R04 = c000000004880000 R20 = 0000000000000001
R05 = c000000004884000 R21 = 00003fff5400565d
R06 = c000000004884000 R22 = 00003fff5875aa80
R07 = 000000000000003e R23 = 00003fff7fa914e0
R08 = 0000000000000000 R24 = 00003fff7fa90b90
R09 = 0000000000000006 R25 = c000000000df4200
R10 = 0000000000000000 R26 = c000001e1b80fe00
R11 = 0000000000000006 R27 = c000001e3a2d1698
R12 = c000000000881fc0 R28 = c000000001550f98
R13 = c000000007b89a00 R29 = c000000004880260
R14 = 0000000000000000 R30 = c0000000048802ac
R15 = 0000000000000000 R31 = c000000004880000
pc = c00000000088217c xhci_irq+0x1bc/0xf50
cfar= c000000000008468 slb_miss_realmode+0x50/0x78
lr = c000000000882050 xhci_irq+0x90/0xf50
msr = 9000000102009033 cr = 28028882
ctr = c000000000881fc0 xer = 0000000000000000 trap = 300
dar = 0000000000000028 dsisr = 40000000
7c:mon> d c000000000b000f0
c000000000b000f0 4c696e7578207665 7273696f6e20342e |Linux version 4.|
c000000000b00100 342e302d32342d67 656e657269632028 |4.0-24-generic (|
c000000000b00110 6275696c64644062 6f7330312d707063 |buildd@bos01-ppc|
c000000000b00120 3634656c2d303233 2920286763632076 |64el-023) (gcc v|

== Comment: #9 - Gabriel Krisman Bertazi - 2016-06-27 08:43:33 ==

(In reply to comment #0)
> We have Ubuntun 16.04.1 version on our SuperMicro system and some of the
> virtual packages installed. Define a guest with a pci passthrough is fine
> but then system crashes at xhci_irq+0x1bc/0xf50 after we start the guest....
>
> 7c:mon> e
> cpu 0x7c: Vector: 300 (Data Access) at [c000001e1b80f760]
> pc: c00000000088217c: xhci_irq+0x1bc/0xf50
> lr: c000000000882050: xhci_irq+0x90/0xf50
> sp: c000001e1b80f9e0
> msr: 9000000102009033
> dar: 28
> dsisr: 40000000
> current = 0xc000001e1bc82a20
> paca = 0xc000000007b89a00 softe: 0 irq_happened: 0x01
> pid = 4026, comm = libvirtd

Hi,

From a quick look, it seems you are missing this commit:

commit 27a41a83ec54d0edfcaf079310244e7f013a7701
Author: Gabriel Krisman Bertazi <email address hidden>
Date: Wed Jun 1 18:09:07 2016 +0300

    xhci: Cleanup only when releasing primary hcd

==

Canonical,

Please backport to 16.04.01

Revision history for this message
bugproxy (bugproxy) wrote : full log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-143075 severity-critical targetmilestone-inin16041
Revision history for this message
bugproxy (bugproxy) wrote : dumpxml file

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : lspci -vv

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-06-29 14:00 EDT-------
(In reply to comment #28)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

Hello Canonical,

I did apply the -proposed kernel and we still hit this issue....system is in xmon now.
0:mon> ls linux_banner
linux_banner: c000000000b000f0
0:mon> d c000000000b000f0
c000000000b000f0 4c696e7578207665 7273696f6e20342e |Linux version 4.|
c000000000b00100 342e302d32382d67 656e657269632028 |4.0-28-generic (|
c000000000b00110 6275696c64644062 6f7330312d707063 |buildd@bos01-ppc|
c000000000b00120 3634656c2d303138 2920286763632076 |64el-018) (gcc v|
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000000fe9eaf760]
pc: c000000000882bfc: xhci_irq+0x1bc/0xf50
lr: c000000000882ad0: xhci_irq+0x90/0xf50
sp: c000000fe9eaf9e0
msr: 9000000102009033
dar: 28
dsisr: 40000000
current = 0xc000000fe5c2b710
paca = 0xc000000007b40000 softe: 0 irq_happened: 0x01
pid = 3945, comm = libvirtd
0:mon>

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 15:39 EDT-------
Hello Canonical,
Sorry, I use apt-get dist-upgrade and it installed the -28 kernel.
But when I use "aptitude" command then I get my system upgrade to -29 kernel.
With the -29 kernel, I am able to start my guest that has the pci pass through.

root@micro:~# uname -r
4.4.0-29-generic

root@micro:~# virsh list
Id Name State
----------------------------------------------------
6 microg4 running

I also see this error "xhci_hcd". Should I be worried about that init fail.......
root@micro:~# dmesg |grep "xhci_hcd"
[ 1.884017] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 1.884079] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 1
[ 1.884166] xhci_hcd 0001:09:00.0: Using 64-bit DMA iommu bypass
[ 1.884229] xhci_hcd 0001:09:00.0: hcc params 0x0270f06d hci version 0x96 quirks 0x00000000
[ 1.884936] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 1.884941] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 2
[ 2.193049] usb 1-3: new high-speed USB device number 2 using xhci_hcd
[ 2.433162] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[ 2.561045] usb 1-4: new high-speed USB device number 3 using xhci_hcd
[ 2.801107] usb 2-4: new SuperSpeed USB device number 3 using xhci_hcd
[ 2.913045] usb 1-3.1: new low-speed USB device number 4 using xhci_hcd
[ 68.765623] xhci_hcd 0001:09:00.0: remove, state 1
[ 68.865172] xhci_hcd 0001:09:00.0: Host not halted after 16000 microseconds.
[ 68.865175] xhci_hcd 0001:09:00.0: Host controller not halted, aborting reset.
[ 68.865244] xhci_hcd 0001:09:00.0: USB bus 2 deregistered
[ 68.865299] xhci_hcd 0001:09:00.0: remove, state 1
[ 69.329779] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
[ 70.233109] xhci_hcd 0001:09:00.0: xHCI Host Controller
[ 70.233116] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus number 1
[ 70.264505] xhci_hcd 0001:09:00.0: Host not halted after 16000 microseconds.
[ 70.264507] xhci_hcd 0001:09:00.0: can't setup: -110
[ 70.264586] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
[ 70.264597] xhci_hcd 0001:09:00.0: init 0001:09:00.0 fail, -110
[ 70.264652] xhci_hcd: probe of 0001:09:00.0 failed with error -110

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 16:57 EDT-------
(In reply to comment #32)
> Hello Canonical,
> Sorry, I use apt-get dist-upgrade and it installed the -28 kernel.
> But when I use "aptitude" command then I get my system upgrade to -29 kernel.
> With the -29 kernel, I am able to start my guest that has the pci pass
> through.
>
> root@micro:~# uname -r
> 4.4.0-29-generic
>
> root@micro:~# virsh list
> Id Name State
> ----------------------------------------------------
> 6 microg4 running
>
> I also see this error "xhci_hcd". Should I be worried about that init
> fail.......
> root@micro:~# dmesg |grep "xhci_hcd"
> [ 1.884017] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 1.884079] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 1
> [ 1.884166] xhci_hcd 0001:09:00.0: Using 64-bit DMA iommu bypass
> [ 1.884229] xhci_hcd 0001:09:00.0: hcc params 0x0270f06d hci version 0x96
> quirks 0x00000000
> [ 1.884936] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 1.884941] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 2
> [ 2.193049] usb 1-3: new high-speed USB device number 2 using xhci_hcd
> [ 2.433162] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
> [ 2.561045] usb 1-4: new high-speed USB device number 3 using xhci_hcd
> [ 2.801107] usb 2-4: new SuperSpeed USB device number 3 using xhci_hcd
> [ 2.913045] usb 1-3.1: new low-speed USB device number 4 using xhci_hcd
> [ 68.765623] xhci_hcd 0001:09:00.0: remove, state 1
> [ 68.865172] xhci_hcd 0001:09:00.0: Host not halted after 16000
> microseconds.
> [ 68.865175] xhci_hcd 0001:09:00.0: Host controller not halted, aborting
> reset.
> [ 68.865244] xhci_hcd 0001:09:00.0: USB bus 2 deregistered
> [ 68.865299] xhci_hcd 0001:09:00.0: remove, state 1
> [ 69.329779] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
> [ 70.233109] xhci_hcd 0001:09:00.0: xHCI Host Controller
> [ 70.233116] xhci_hcd 0001:09:00.0: new USB bus registered, assigned bus
> number 1
> [ 70.264505] xhci_hcd 0001:09:00.0: Host not halted after 16000
> microseconds.
> [ 70.264507] xhci_hcd 0001:09:00.0: can't setup: -110
> [ 70.264586] xhci_hcd 0001:09:00.0: USB bus 1 deregistered
> [ 70.264597] xhci_hcd 0001:09:00.0: init 0001:09:00.0 fail, -110
> [ 70.264652] xhci_hcd: probe of 0001:09:00.0 failed with error -110

This log was taken from the host after the guest is destroyed, right? That's a different issue, which also reproduces upstream. I think it has something to do with an errata for this hardware.

Does the controller probe successfully from inside the guest?

We should have a new bug opened to track it.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-29 17:47 EDT-------
> Does the controller probe successfully from inside the guest?
It probe successfully inside the guest.

bugproxy (bugproxy)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.8 KiB)

This bug was fixed in the package linux - 4.4.0-30.49

---------------
linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (namespace) bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
    - (namespace) bpf, inode: disallow userns mounts
    - (namespace) ipc: Initialize ipc_namespace->user_ns early.
    - (namespace) vfs: Pass data, ns, and ns->userns to mount_ns
    - SAUCE: (namespace) S...

Read more...

Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-12 10:35 EDT-------
(In reply to comment #37)
> This bug was fixed in the package linux - 4.4.0-30.49

Thanks!

Chanh, please give a last try to 4.4.0-30.49 such that we can close this.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.1 KiB)

This bug was fixed in the package linux - 4.4.0-31.50

---------------
linux (4.4.0-31.50) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1602449

  * nouveau: boot hangs at blank screen with unsupported graphics cards
    (LP: #1602340)
    - SAUCE: drm: check for supported chipset before booting fbdev off the hw

linux (4.4.0-30.49) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597897

  * FCP devices are not detected correctly nor deterministically (LP: #1567602)
    - scsi_dh_alua: Disable ALUA handling for non-disk devices
    - scsi_dh_alua: Use vpd_pg83 information
    - scsi_dh_alua: improved logging
    - scsi_dh_alua: sanitze sense code handling
    - scsi_dh_alua: use standard logging functions
    - scsi_dh_alua: return standard SCSI return codes in submit_rtpg
    - scsi_dh_alua: fixup description of stpg_endio()
    - scsi_dh_alua: use flag for RTPG extended header
    - scsi_dh_alua: use unaligned access macros
    - scsi_dh_alua: rework alua_check_tpgs() to return the tpgs mode
    - scsi_dh_alua: simplify sense code handling
    - scsi: Add scsi_vpd_lun_id()
    - scsi: Add scsi_vpd_tpg_id()
    - scsi_dh_alua: use scsi_vpd_tpg_id()
    - scsi_dh_alua: Remove stale variables
    - scsi_dh_alua: Pass buffer as function argument
    - scsi_dh_alua: separate out alua_stpg()
    - scsi_dh_alua: Make stpg synchronous
    - scsi_dh_alua: call alua_rtpg() if stpg fails
    - scsi_dh_alua: switch to scsi_execute_req_flags()
    - scsi_dh_alua: allocate RTPG buffer separately
    - scsi_dh_alua: Use separate alua_port_group structure
    - scsi_dh_alua: use unique device id
    - scsi_dh_alua: simplify alua_initialize()
    - revert commit a8e5a2d593cb ("[SCSI] scsi_dh_alua: ALUA handler attach should
      succeed while TPG is transitioning")
    - scsi_dh_alua: move optimize_stpg evaluation
    - scsi_dh_alua: remove 'rel_port' from alua_dh_data structure
    - scsi_dh_alua: Use workqueue for RTPG
    - scsi_dh_alua: Allow workqueue to run synchronously
    - scsi_dh_alua: Add new blacklist flag 'BLIST_SYNC_ALUA'
    - scsi_dh_alua: Recheck state on unit attention
    - scsi_dh_alua: update all port states
    - scsi_dh_alua: Send TEST UNIT READY to poll for transitioning
    - scsi_dh_alua: do not fail for unknown VPD identification

linux (4.4.0-29.48) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1597015

  * Wireless hotkey fails on Dell XPS 15 9550 (LP: #1589886)
    - intel-hid: new hid event driver for hotkeys
    - intel-hid: fix incorrect entries in intel_hid_keymap
    - intel-hid: allocate correct amount of memory for private struct
    - intel-hid: add a workaround to ignore an event after waking up from S4.
    - [Config] CONFIG_INTEL_HID_EVENT=m

  * cgroupfs mounts can hang (LP: #1588056)
    - Revert "UBUNTU: SAUCE: (namespace) mqueue: Super blocks must be owned by the
      user ns which owns the ipc ns"
    - Revert "UBUNTU: SAUCE: kernfs: Do not match superblock in another user
      namespace when mounting"
    - Revert "UBUNTU: SAUCE: cgroup: Use a new super block when mounting in a
      cgroup namespace"
    - (name...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.