Fix CPU lockup on Hyper-V/Azure

Bug #1498206 reported by Stephen A. Zarkos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Tim Gardner
Trusty
Fix Released
High
Joseph Salisbury
Vivid
Fix Released
High
Joseph Salisbury
Wily
Fix Released
Medium
Tim Gardner

Bug Description

Description of problem:
Large sized VMs with 32 VCPUs and more than 100G memory sometimes hang at boot, and the console keeps printing logs like "soft lockup - CPU#14 stuck for 23s!"

This particular issue can be resolved with the following patch upstream:
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/arch/x86/kernel/cpu?id=88c9281a9fba67636ab26c1fd6afbc78a632374f

Repro rate is low. To repro, provision a larger VM (i.e. G5 VM on Azure) and reboot many times, monitoring the console until the "soft lockup - CPU" error is seen.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1498206

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Stephen A. Zarkos (stevez) wrote :

No logs required.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key kernel-hyper-v
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Looks like this patch landed in v4.3-rc1:

commit 88c9281a9fba67636ab26c1fd6afbc78a632374f
Author: Vitaly Kuznetsov <email address hidden>
Date: Wed Aug 19 09:54:24 2015 -0700

    x86/hyperv: Mark the Hyper-V TSC as unstable

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Wily test kernel with a cherry-pick of commit 88c9281a. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1498206/wily

Can this kernel be tested to confirm it resolves this bug? If it does, I can also built Vivid and Trusty test kernels.

Thanks in advance!

Changed in linux (Ubuntu Vivid):
status: New → Triaged
Changed in linux (Ubuntu Trusty):
status: New → Triaged
Changed in linux (Ubuntu Vivid):
importance: Undecided → Medium
Changed in linux (Ubuntu Wily):
importance: High → Medium
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Vivid):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Wily):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu Vivid):
status: Triaged → In Progress
Changed in linux (Ubuntu Trusty):
status: Triaged → In Progress
tags: added: trusty vivid wily
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Wily):
assignee: Joseph Salisbury (jsalisbury) → Tim Gardner (timg-tpi)
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.2.0-12.14

---------------
linux (4.2.0-12.14) wily; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1499712

  [ Ben Pope ]

  * SAUCE: drivers/net/ethernet/atheros/alx: Add Killer E2400 device ID
    - LP: #1498633

  [ Knuth Posern ]

  * SAUCE: thunderbolt: Allow loading of module on recent Apple MacBooks
    with thunderbolt 2 controller
    - LP: #1497321

  [ Laurent Dufour ]

  * SAUCE: powerpc/hvsi: Fix endianness issues in the HVSI driver
    - LP: #1499357

  [ Upstream Kernel Changes ]

  * x86/hyperv: Mark the Hyper-V TSC as unstable
    - LP: #1498206
  * intel_pstate: fix PCT_TO_HWP macro
    - LP: #1499040
  * perf/x86/intel/rapl: Add support for Knights Landing (KNL)
    - LP: #1461370
  * drm/i915: Add audio pin sense / ELD callback
    - LP: #1398277
  * drm/i915: Call audio pin/ELD notify function
    - LP: #1398277
  * ALSA: hda - allow codecs to access the i915 pin/ELD callback
    - LP: #1398277
  * ALSA: hda - Wake the codec up on pin/ELD notify events
    - LP: #1398277
  * drm/i915: Add locks around audio component bind/unbind
    - LP: #1398277
  * drm/i915: Drop port_mst_index parameter from pin/eld callback
    - LP: #1398277

 -- Tim Gardner <email address hidden> Thu, 24 Sep 2015 09:19:23 -0600

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Revision history for this message
Stephen A. Zarkos (stevez) wrote :

HI Joseph,

Can we please raise the priority of this to get it into Trusty and Vivid as soon as possible? We are seeing more repros of this issue on Azure.

Thanks,
Steve

Changed in linux (Ubuntu Trusty):
importance: Medium → High
Changed in linux (Ubuntu Vivid):
importance: Medium → High
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
tags: added: verification-done-trusty verification-done-vivid
removed: verification-needed-trusty verification-needed-vivid
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-37.42

---------------
linux (3.19.0-37.42) vivid; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1518406

  [ K. Y. Srinivasan ]

  * SAUCE: Drivers: hv: vmbus: Fix a Host signaling bug
    - LP: #1508706

 -- Kamal Mostafa <email address hidden> Fri, 20 Nov 2015 09:49:10 -0800

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.3 KiB)

This bug was fixed in the package linux - 3.13.0-70.113

---------------
linux (3.13.0-70.113) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1516733

  [ Upstream Kernel Changes ]

  * arm64: errata: use KBUILD_CFLAGS_MODULE for erratum #843419
    - LP: #1516682

linux (3.13.0-69.112) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1514858

  [ Joseph Salisbury ]

  * SAUCE: storvsc: use small sg_tablesize on x86
    - LP: #1495983

  [ Luis Henriques ]

  * [Config] updateconfigs after 3.13.11-ckt28 and 3.13.11-ckt29 stable
    updates

  [ Upstream Kernel Changes ]

  * ext4: fix indirect punch hole corruption
    - LP: #1292234
  * x86/hyperv: Mark the Hyper-V TSC as unstable
    - LP: #1498206
  * namei: permit linking with CAP_FOWNER in userns
    - LP: #1498162
  * iwlwifi: pci: add a few more PCI subvendor IDs for the 7265 series
    - LP: #1510616
  * Drivers: hv: vmbus: Increase the limit on the number of pfns we can
    handle
    - LP: #1495983
  * sctp: fix race on protocol/netns initialization
    - LP: #1514832
  * [media] v4l: omap3isp: Fix sub-device power management code
    - LP: #1514832
  * [media] rc-core: fix remove uevent generation
    - LP: #1514832
  * xtensa: fix threadptr reload on return to userspace
    - LP: #1514832
  * ARM: OMAP2+: DRA7: clockdomain: change l4per2_7xx_clkdm to SW_WKUP
    - LP: #1514832
  * mac80211: enable assoc check for mesh interfaces
    - LP: #1514832
  * PCI: Add dev_flags bit to access VPD through function 0
    - LP: #1514832
  * PCI: Add VPD function 0 quirk for Intel Ethernet devices
    - LP: #1514832
  * usb: dwc3: ep0: Fix mem corruption on OUT transfers of more than 512
    bytes
    - LP: #1514832
  * serial: 8250_pci: Add support for Pericom PI7C9X795[1248]
    - LP: #1514832
  * KVM: MMU: fix validation of mmio page fault
    - LP: #1514832
  * auxdisplay: ks0108: fix refcount
    - LP: #1514832
  * devres: fix devres_get()
    - LP: #1514832
  * iio: adis16400: Fix adis16448 gyroscope scale
    - LP: #1514832
  * iio: Add inverse unit conversion macros
    - LP: #1514832
  * iio: adis16480: Fix scale factors
    - LP: #1514832
  * iio: industrialio-buffer: Fix iio_buffer_poll return value
    - LP: #1514832
  * iio: event: Remove negative error code from iio_event_poll
    - LP: #1514832
  * NFSv4: don't set SETATTR for O_RDONLY|O_EXCL
    - LP: #1514832
  * unshare: Unsharing a thread does not require unsharing a vm
    - LP: #1514832
  * ASoC: adav80x: Remove .read_flag_mask setting from
    adav80x_regmap_config
    - LP: #1514832
  * drivers: usb :fsl: Implement Workaround for USB Erratum A007792
    - LP: #1514832
  * drivers: usb: fsl: Workaround for USB erratum-A005275
    - LP: #1514832
  * serial: 8250: don't bind to SMSC IrCC IR port
    - LP: #1514832
  * staging: comedi: adl_pci7x3x: fix digital output on PCI-7230
    - LP: #1514832
  * blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    - LP: #1514832
  * xtensa: fix kernel register spilling
    - LP: #1514832
  * NFS: nfs_set_pgio_error sometimes misses errors
    - LP: #1514832
  * NFS: Fix a NULL pointer dereference of migration...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.