lots of printk to serial console can hang system for long time

Bug #1534216 reported by Dan Streetman
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Dan Streetman
Xenial
Fix Released
Medium
Dan Streetman

Bug Description

This is a clone from bug 1505564, to track the separate issue of the serial port driver failing to schedule itself off its cpu.

The original bug's problem was caused by the kernel spamming a huge number of error messages in a certain situation. Normally, that would not be a problem, but in this case the system is virtualized, and logs over its serial port. When the massive number of kernel messages are sent to the serial port driver, it can't keep up, so sending all the log messages can take a very long time - minutes or longer - and the serial port driver fails to schedule itself off the cpu it's using during that time. That results in other cpus hanging, waiting for the serial port driver's cpu to become avaiable.

I'll update the bug with more details as I debug.

Tags: sts
Dan Streetman (ddstreet)
Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
status: New → In Progress
Revision history for this message
William Grant (wgrant) wrote :

I don't believe we've ever seen this in VMs. The main place excessive kernel output has been problematic is on scalingstack compute nodes, which are mostly HP ProLiant DL360p Gen8s (AMD64) and HP ProLiant m400s (aka. mcdivitt, ARMv8). Both have a virtual serial port exposed by the BMC or chassis, and kernel messages go to that. But the machines run VMs -- they're not themselves virtualised.

In the DL360p case we get CPU hangs, while in the m400 case we get excessive memory consumption.

Revision history for this message
Dan Streetman (ddstreet) wrote : Re: serial port driver under heavy load can hang system

Ah ok, thnx. bug title updated.

summary: - serial port driver in some virtual systems can hang
+ serial port driver under heavy load can hang system
penalvch (penalvch)
Changed in linux (Ubuntu):
importance: Undecided → Medium
Dan Streetman (ddstreet)
summary: - serial port driver under heavy load can hang system
+ lots of printk to serial console can hang system for long time
Revision history for this message
Dan Streetman (ddstreet) wrote :

This isn't actually an issue with the serial port driver, it's the printk code that is failing to schedule while printing out all the msgs. During printk(), if the console lock is held, the msgs are queued up, and inside console_unlock() any queued msgs are sent to the console before unlocking:

void console_unlock(void)
{
...
        bool do_cond_resched, retry;
...
        /*
         * Console drivers are called under logbuf_lock, so
         * @console_may_schedule should be cleared before; however, we may
         * end up dumping a lot of lines, for example, if called from
         * console registration path, and should invoke cond_resched()
         * between lines if allowable. Not doing so can cause a very long
         * scheduling stall on a slow console leading to RCU stall and
         * softlockup warnings which exacerbate the issue with more
         * messages practically incapacitating the system.
         */
        do_cond_resched = console_may_schedule;
        console_may_schedule = 0;
...
        for (;;) {
...
                call_console_drivers(level, text, len);
...
         if (do_cond_resched)
                 cond_resched();
 }

so what happens here is, console_may_schedule is 0 and therefore cond_resched() isn't called between printk msgs sent to the serial port driver; and the serial port takes so long to actually send the msgs, eventually rcu thinks the cpu has soft-locked up.

Revision history for this message
Dan Streetman (ddstreet) wrote :

this is (mostly) fixed in 2 commits, from the original bug (bug 1505564) kernel level commit 5a3a466ae3c86ad6b8fc36c2efc5f541ef0ed89e (from git://kernel.ubuntu.com/ubuntu/ubuntu-trusty.git lts-backport-vivid) partially fixes this problem, by using the console_may_schedule var correctly. The recent upstream commit 6b97a20d3a7909daa06625d4440c2c52d7bf08d7 (from upstream) completes fixing the problem, by setting console_may_schedule correctly.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

8d91f8b15361dfb438ab6eb3b319e2ded43458ff ('printk: do cond_resched() between lines while outputting to consoles') came in via 4.4 stable. I also cherry-picked 6b97a20d3a7909daa06625d4440c2c52d7bf08d7 ('printk: set may_schedule for some of console_trylock() callers')

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.5 KiB)

This bug was fixed in the package linux - 4.4.0-17.33

---------------
linux (4.4.0-17.33) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1563441

  * ISST-LTE: pVM:high cpus number need a high crashkernel value in kdump
    (LP: #1560552)
    - SAUCE: (noup) ppc64 boot: Wait for boot cpu to show up if nr_cpus limit is
      about to hit.

  * Predictable naming mechanism is leading to issues in DLPAR operations of
    NICs (LP: #1560514)
    - SAUCE: (noup) powerpc/pci: Assign fixed PHB number based on device-tree
      properties

  * ThunderX: support alternative phy implementations (LP: #1562968)
    - net: thunderx: Cleanup PHY probing code.
    - [Config] CONFIG_MDIO_CAVIUM=m
    - phy: mdio-octeon: Refactor into two files/modules
    - [Config] CONFIG_MDIO_THUNDER=m
    - phy: mdio-thunder: Add driver for Cavium Thunder SoC MDIO buses.
    - phy: mdio-cavium: Add missing MODULE_* annotations.
    - net: cavium: For Kconfig THUNDER_NIC_BGX, select MDIO_THUNDER.
    - phy: mdio-thunder: Fix some Kconfig typos
    - [d-i] Add phy drivers for Cavium ThunderX to nic-modules udeb

  * linux: exclude ZONE_DEVICE from GFP_ZONE_TABLE (LP: #1563293)
    - Revert "mm: CONFIG_NR_ZONES_EXTENDED"
    - mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE

  * lots of printk to serial console can hang system for long time
    (LP: #1534216)
    - printk: set may_schedule for some of console_trylock() callers

  * [i915_bpo] Update i915 backport driver (LP: #1560395)
    - SAUCE: i915_bpo: Update to drm-intel-next-fixes-2016-03-16
    - PM / runtime: Add new helper for conditional usage count incrementation
    - drm/core: Add drm_for_each_encoder_mask, v2.
    - drm/atomic-helper: Implement subsystem-level suspend/resume

  * [Hyper-V] VM Sockets (LP: #1541585)
    - Drivers: hv: vmbus: Cleanup vmbus_set_event()
    - Drivers: hv: vmbus: Add vendor and device atttributes
    - Drivers: hv: vmbus: avoid infinite loop in init_vp_index()
    - Drivers: hv: vmbus: avoid scheduling in interrupt context in vmbus_initiate_unload()
    - Drivers: hv: vmbus: don't manipulate with clocksources on crash
    - Drivers: hv: vmbus: add a helper function to set a channel's pending send size
    - Drivers: hv: vmbus: define the new offer type for Hyper-V socket (hvsock)
    - Drivers: hv: vmbus: vmbus_sendpacket_ctl: hvsock: avoid unnecessary signaling
    - Drivers: hv: vmbus: define a new VMBus message type for hvsock
    - Drivers: hv: vmbus: add a hvsock flag in struct hv_driver
    - Drivers: hv: vmbus: add a per-channel rescind callback
    - Drivers: hv: vmbus: add an API vmbus_hvsock_device_unregister()
    - Drivers: hv: vmbus: Eliminate the spin lock on the read path
    - Drivers: hv: vmbus: Give control over how the ring access is serialized
    - drivers/hv: Move VMBus hypercall codes into Hyper-V UAPI header
    - Drivers: hv: vmbus: don't loose HVMSG_TIMER_EXPIRED messages
    - Drivers: hv: vmbus: avoid wait_for_completion() on crash
    - Drivers: hv: vmbus: remove code duplication in message handling
    - Drivers: hv: vmbus: avoid unneeded compiler optimizations in vmbus_wait_for_unload()
    - Drivers: hv: util: Pass the chann...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.