Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Bug #1821259 reported by Mauricio Faria de Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Xenial
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * This problem hard locks up 2 CPUs in a deadlock, and this
   soft locks up other CPUs as an effect; the system becomes
   unusable.

 * This is relatively rare / difficult to hit because it's a
   corner case in scheduling/load balancing that needs timing
   with CPU stopper code. And it needs SMP plus _NUMA_ system.
   (but it can be hit with synthetic test case attached in LP.)

 * Since SMP plus NUMA usually equals _servers_ it looks like
   a good idea to prevent this bug / hard lockups / rebooting.

 * The fix resolves the potential deadlock by removing one of
   the calls required to deadlock from under the locked code.

[Test Case]

 * There's a synthetic test case to reproduce this problem
   (although without the stack traces - just a system hang)
   attached to this LP bug.

 * It uses kprobes/mdelay/cpu stopper calls to force the code
   to execute and force the timing/locking condition to occur.

 * $ sudo insmod kmod-stopper.ko

   Some dmesg logging occurs, and systems either hangs or not.
   See examples in comments.

[Regression Potential]

 * These are patches to the cpu stop_machine.c code, and they
   change a bit how it works; however, there are no upstream
   fixes for these patches anymore and they are still the top
   of the 'git log --oneline -- kernel/stop_machine.c' output.

 * These patches have been verified with the synthetic test case
   and 'stress-ng --class scheduler --sequential 0' (no regressions)
   on guest with 2 CPUs and one physical system with 24 CPUs.

[Other Info]

 * The patches are required on Xenial and later.
 * There are 4 patches for Xenial, and 2 patches pending for Bionic.
 * All patches are applied from Cosmic onwards.

[Original Description]

These 2 hard lockups happened all of a sudden in the logs, and many soft lockups occur after them as a fallout.

    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: Watchdog detected hard LOCKUP on cpu 10
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: <...>
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 0010:[<ffffffff810c8cb0>] [<ffffffff810c8cb0>] native_queued_spin_lock_slowpath+0x160/0x170
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 0000:ffff883ff2113c58 EFLAGS: 00000002
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS: 0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092] ffff883ff2113c68 ffffffff811870eb ffff883ff2113c80 ffffffff81819907
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094] ffff881fff991ba0 ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096] ffff881fff997c90 ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105] [<ffffffff811870eb>] queued_spin_lock_slowpath+0xb/0xf
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109] [<ffffffff81819907>] _raw_spin_lock_irqsave+0x37/0x40
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113] [<ffffffff8111c600>] cpu_stop_queue_work+0x30/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116] [<ffffffff8111ccd0>] stop_one_cpu_nowait+0x30/0x40
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119] [<ffffffff810bbb5b>] load_balance+0x71b/0x940
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122] [<ffffffff810bbff5>] pick_next_task_fair+0x275/0x4b0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126] [<ffffffff81816166>] __schedule+0x6c6/0x7f0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132] [<ffffffff810a2560>] ? sort_range+0x30/0x30
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134] [<ffffffff818162c5>] schedule+0x35/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136] [<ffffffff810a262d>] smpboot_thread_fn+0xcd/0x180
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139] [<ffffffff8109f138>] kthread+0xd8/0xf0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141] [<ffffffff8109f060>] ? kthread_park+0x60/0x60
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143] [<ffffffff81819ff5>] ret_from_fork+0x55/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144] [<ffffffff8109f060>] ? kthread_park+0x60/0x60

    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: Watchdog detected hard LOCKUP on cpu 6
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: <...>
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 0010:[<ffffffff810aacb6>] [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 0000:ffff883cda77fa78 EFLAGS: 00000002
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000017300
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS: 00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365] 0000000000000202 ffff883cda77fa98 0000000000000003 0000000000000006
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368] 000000000000000a ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370] ffff883cda77fac8 ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375] [<ffffffff810ab035>] wake_up_process+0x15/0x20
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379] [<ffffffff8111cc22>] stop_two_cpus+0x1b2/0x230
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382] [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384] [<ffffffff810b5d15>] ? dequeue_entity+0x455/0x8a0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386] [<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388] [<ffffffff810aaa70>] ? __migrate_swap_task.part.83+0x80/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390] [<ffffffff810ab18e>] migrate_swap+0xae/0x130
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392] [<ffffffff810b4e44>] task_numa_migrate+0x504/0x930
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394] [<ffffffff810b52e9>] numa_migrate_preferred+0x79/0x80
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396] [<ffffffff810b9373>] task_numa_fault+0x923/0xcd0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400] [<ffffffff8175e407>] ? tcp_recvmsg+0x6b7/0xbd0
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404] [<ffffffff811da9be>] ? mpol_misplaced+0x14e/0x190
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408] [<ffffffff811b7836>] handle_pte_fault+0x5a6/0x1440
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411] [<ffffffff816f6693>] ? sock_recvmsg+0x43/0x50
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413] [<ffffffff811b9540>] handle_mm_fault+0x250/0x540
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417] [<ffffffff81069e1a>] __do_page_fault+0x19a/0x430
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419] [<ffffffff8106a0d2>] do_page_fault+0x22/0x30
    Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423] [<ffffffff8181c5a8>] page_fault+0x28/0x30

CVE References

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (4.0 KiB)

Analysis
--------

The 1st hard lockup is harder to get the interesting data out of, as apparently the registers with variables
related to the cpu number have been clobbered by more recent calls in the spinlock path.

Looking at the 2nd hard lockup:

addr2line + code shows us that try_to_wake_up() in line 1997 is indeed looping with IRQs disabled in line 1939 (thus a hard lockup):

    $ addr2line -pifae ddeb-116.140/usr/lib/debug/boot/vmlinux-4.4.0-116-generic 0xffffffff810aacb6
    0xffffffff810aacb6: try_to_wake_up at /build/linux-lts-xenial-ozsla7/linux-lts-xenial-4.4.0/kernel/sched/core.c:1997

    1926 static int
    1927 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
    1928 {
    ...
    1939 raw_spin_lock_irqsave(&p->pi_lock, flags);
    ...
    1993 /*
    1994 * If the owning (remote) cpu is still in the middle of schedule() with
    1995 * this task as prev, wait until its done referencing the task.
    1996 */
    1997 while (p->on_cpu)
    1998 cpu_relax();
    ...
    2027 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    2028
    2029 return success;
    2030 }

The objdump disassembly of try_to_wake_up() in vmlinux for the RIP instruction address (ffffffff810aacb6),
shows a while loop that just checks for non-zero 'p->on_cpu' and calls cpu_relax() (which translates to the 'pause' instruction):

    ffffffff810aacb1: f3 90 pause
    ffffffff810aacb3: 8b 43 28 mov 0x28(%rbx),%eax
    ffffffff810aacb6: 85 c0 test %eax,%eax
    ffffffff810aacb8: 75 f7 jne ffffffff810aacb1 <try_to_wake_up+0x81>

So, it checks for the value in pointer in RBX + offset 0x28, which according to the 'pahole' tool, is indeed the 'on_cpu' field:

    $ pahole --hex -C task_struct ddeb-116.140/usr/lib/debug/boot/vmlinux-4.4.0-116-generic | grep on_cpu
        int on_cpu; /* 0x28 0x4 */

So, the task_struct pointer is in RBX, which is:

    RBX: ffff883ff2a76200

And that matches the other hard locked up task on CPU 10 (see its 'task:' field).

Per the stack trace in CPU 10, and the identical timestamp of the two hard lockup messages, and the fact both stack traces are cpu_stopper related,
it does look like CPU 10 is waiting on the spinlock of one of the 2 cpu stoppers held by CPU 6, which is exactly the scenario in the suggested patch.

The problem/fix has been verified with a synthetic test-case (attached).

commit 0b26351b910fb8fe6a056f8a1bbccabe50c0e19f
Author: Peter Zijlstra <email address hidden>
Date: Fri Apr 20 11:50:05 2018 +0200

    stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock

    Matt reported the following deadlock:

    CPU0 CPU1

    schedule(.prev=migrate/0) <fault>
      pick_next_task() ...
        idle_balance() migrate_swap()
          active_balance() stop_two_cpus()
                 ...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test-case (kmod-stopper.c)
---------

$ sudo apt-get -y install gcc make libelf-dev linux-headers-$(uname -r)

$ touch Makefile # fake it, and use this make line:
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) obj-m=kmod-stopper.o modules

$ echo 9 | sudo tee /proc/sys/kernel/printk

$ sudo insmod kmod-stopper.ko
<watch console for messages>
<it either hangs / finishes>

$ sudo rmmod kmod-stopper

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1821259

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test-case on Xenial;

$ ls -1d /sys/devices/system/cpu/cpu[0-9]*
/sys/devices/system/cpu/cpu0
/sys/devices/system/cpu/cpu1

Original
--------

$ uname -rv
4.4.0-144-generic #170-Ubuntu SMP Thu Mar 14 11:56:20 UTC 2019

$ sudo insmod kmod-stopper/kmod-stopper.ko
[ 74.198379] mod_init() :: this cpu = 0x1, that cpu = 0x0
[ 74.199613] mod_init() :: that_cpu_stopper_task = ffff88003d80e600, comm = migration/0
[ 74.206194] kp2/stop_two_cpus() :: this cpu = 0x1, that cpu = 0x0
[ 74.206196] do_nothing() :: this cpu = 0x0, that cpu = 0x1
[ 74.206201] kp1/pick_next_task_fair() :: this cpu = 0x0, that cpu = 0x1
[ 74.206203] kp1/pick_next_task_fair() :: before sleep (1000 msecs)
[ 74.212759] kp2/stop_two_cpus() :: before sleep (500 msecs)
[ 74.710138] kp2/stop_two_cpus() :: after sleep (500 msecs)
[ 75.198324] kp1/pick_next_task_fair() :: after sleep (1000 msecs)
[ 75.199814] kp1/pick_next_task_fair() :: stopping other cpu...
<hang>

The test-case only failed 2 out of 50+ tests.

Patched:
-------

$ uname -rv
4.4.0-144-generic #170+test20190320b1 SMP Wed Mar 20 18:35:06 UTC 2019

$ sudo insmod kmod-stopper/kmod-stopper.ko
[ 85.958527] mod_init() :: this cpu = 0x1, that cpu = 0x0
[ 85.965876] mod_init() :: that_cpu_stopper_task = ffff88003d80e600, comm = migration/0
[ 85.993446] kp2/stop_two_cpus() :: this cpu = 0x1, that cpu = 0x0
[ 85.993471] do_nothing() :: this cpu = 0x0, that cpu = 0x1
[ 85.993477] kp1/pick_next_task_fair() :: this cpu = 0x0, that cpu = 0x1
[ 85.993480] kp1/pick_next_task_fair() :: before sleep (1000 msecs)
[ 86.019469] kp2/stop_two_cpus() :: before sleep (500 msecs)
[ 86.521688] kp2/stop_two_cpus() :: after sleep (500 msecs)
[ 86.987662] kp1/pick_next_task_fair() :: after sleep (1000 msecs)
[ 86.989427] kp1/pick_next_task_fair() :: stopping other cpu...
[ 86.991109] do_nothing() :: this cpu = 0x1, that cpu = 0x0
[ 86.992615] do_nothing() :: this cpu = 0x1, that cpu = 0x0
<finished>

It passes every time (50+ tests).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Since Bionic already has the fix commit applied,
the original kernel version doesn't hit the problem.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Both xenial and bionic original/patched kernels
were tested with stress-ng scheduler class, and
no regressions were observed.

$ stress-ng --version
stress-ng, version 0.09.56 (gcc 8.3, x86_64 Linux 4.15.0-47-generic) 💻🔥

$ sudo stress-ng --class scheduler --sequential 0

$ uname -rv
4.4.0-144-generic #170-Ubuntu SMP Thu Mar 14 11:56:20 UTC 2019

$ uname -rv
4.4.0-144-generic #170+test20190320b1 SMP Wed Mar 20 18:35:06 UTC 2019

$ uname -rv
4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019

$ uname -rv
4.15.0-47-generic #50+test20190320b1 SMP Wed Mar 20 20:08:03 UTC 2019

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

[X][PATCH 0/4] LP#1821259 Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099427.html

[B][PATCH 0/2] Fix for LP#1821259 (pending patches for) Fix for deadlock in cpu_stopper
https://lists.ubuntu.com/archives/kernel-team/2019-March/099432.html

no longer affects: linux (Ubuntu)
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
tags: added: verification-needed-xenial
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on xenial-proposed.
The testcase cannot hang the system, repeated 10 times.

4.4.0-145-generic

$ sudo insmod kmod-stopper.ko
[ 324.201942] kmod_stopper: loading out-of-tree module taints kernel.
[ 324.205604] kmod_stopper: module verification failed: signature and/or required key missing - tainting kernel
[ 324.213641] mod_init() :: this cpu = 0x2, that cpu = 0x3
[ 324.214802] mod_init() :: that_cpu_stopper_task = ffff88013a97b300, comm = migration/3
[ 324.224825] kp2/stop_two_cpus() :: this cpu = 0x2, that cpu = 0x3
[ 324.224834] do_nothing() :: this cpu = 0x3, that cpu = 0x2
[ 324.224839] kp1/pick_next_task_fair() :: this cpu = 0x3, that cpu = 0x2
[ 324.224841] kp1/pick_next_task_fair() :: before spin (1000 msecs)
[ 324.230226] kp2/stop_two_cpus() :: before spin (500 msecs)
[ 324.727963] kp2/stop_two_cpus() :: after spin (500 msecs)
[ 325.217499] kp1/pick_next_task_fair() :: after spin (1000 msecs)
[ 325.218596] kp1/pick_next_task_fair() :: stopping other cpu...
<hangs>

4.4.0-146-generic

$ sudo insmod kmod-stopper.ko
[ 512.306797] mod_init() :: this cpu = 0x0, that cpu = 0x1
[ 512.308267] mod_init() :: that_cpu_stopper_task = ffff88013a913300, comm = migration/1
[ 512.318288] kp2/stop_two_cpus() :: this cpu = 0x0, that cpu = 0x1
[ 512.318298] do_nothing() :: this cpu = 0x1, that cpu = 0x0
[ 512.318335] kp1/pick_next_task_fair() :: this cpu = 0x1, that cpu = 0x0
[ 512.318337] kp1/pick_next_task_fair() :: before spin (1000 msecs)
[ 512.325303] kp2/stop_two_cpus() :: before spin (500 msecs)
[ 512.823132] kp2/stop_two_cpus() :: after spin (500 msecs)
[ 513.312125] kp1/pick_next_task_fair() :: after spin (1000 msecs)
[ 513.313440] kp1/pick_next_task_fair() :: stopping other cpu...
[ 513.314708] do_nothing() :: this cpu = 0x0, that cpu = 0x1
[ 513.315908] do_nothing() :: this cpu = 0x0, that cpu = 0x1

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Verification done on bionic-proposed.

The Bionic kernel already has the main fix patch,
the new patches are just to bring it up with the
incremental fixes upstream for the main fix patch.

No regressions observed between 4.15.0-{47,48}-generic
with `sudo stress-ng --class scheduler --sequential 0`.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (15.4 KiB)

This bug was fixed in the package linux - 4.4.0-146.172

---------------
linux (4.4.0-146.172) xenial; urgency=medium

  * linux: 4.4.0-146.172 -proposed tracker (LP: #1822834)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * 3b080b2564287be91605bfd1d5ee985696e61d3c in ubuntu_btrfs_kernel_fixes
    triggers system hang on i386 (LP: #1812845)
    - btrfs: raid56: properly unmap parity page in finish_parity_scrub()

  * Xenial update: 4.4.177 upstream stable release (LP: #1822271)
    - ceph: avoid repeatedly adding inode to mdsc->snap_flush_list
    - numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODES
    - KEYS: allow reaching the keys quotas exactly
    - mfd: ti_am335x_tscadc: Use PLATFORM_DEVID_AUTO while registering mfd cells
    - mfd: twl-core: Fix section annotations on {,un}protect_pm_master
    - mfd: db8500-prcmu: Fix some section annotations
    - mfd: ab8500-core: Return zero in get_register_interruptible()
    - mfd: qcom_rpm: write fw_version to CTRL_REG
    - mfd: wm5110: Add missing ASRC rate register
    - mfd: mc13xxx: Fix a missing check of a register-read failure
    - net: hns: Fix use after free identified by SLUB debug
    - MIPS: ath79: Enable OF serial ports in the default config
    - scsi: qla4xxx: check return code of qla4xxx_copy_from_fwddb_param
    - scsi: isci: initialize shost fully before calling scsi_add_host()
    - MIPS: jazz: fix 64bit build
    - isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
    - atm: he: fix sign-extension overflow on large shift
    - leds: lp5523: fix a missing check of return value of lp55xx_read
    - isdn: avm: Fix string plus integer warning from Clang
    - RDMA/srp: Rework SCSI device reset handling
    - KEYS: user: Align the payload buffer
    - KEYS: always initialize keyring_index_key::desc_len
    - batman-adv: fix uninit-value in batadv_interface_tx()
    - net/packet: fix 4gb buffer limit due to overflow check
    - team: avoid complex list operations in team_nl_cmd_options_set()
    - sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
    - net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames
    - ARCv2: Enable unaligned access in early ASM code
    - Revert "bridge: do not add port to router list when receives query with
      source 0.0.0.0"
    - libceph: handle an empty authorize reply
    - drm/msm: Unblock writer if reader closes file
    - ASoC: Intel: Haswell/Broadwell: fix setting for .dynamic field
    - ALSA: compress: prevent potential divide by zero bugs
    - thermal: int340x_thermal: Fix a NULL vs IS_ERR() check
    - usb: dwc3: gadget: Fix the uninitialized link_state when udc starts
    - usb: gadget: Potential NULL dereference on allocation error
    - ASoC: dapm: change snprintf to scnprintf for possible overflow
    - ASoC: imx-audmux: change snprintf to scnprintf for possible overflow
    - ARC: fix __ffs return value to avoid build warnings
    - mac80211: fix miscounting of ttl-dropped frames
    - serial: fsl_lpuart: fix maximum acceptable baud rate with over-sampling
    - scsi: csiostor: fix NULL pointer de...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.6 KiB)

This bug was fixed in the package linux - 4.15.0-48.51

---------------
linux (4.15.0-48.51) bionic; urgency=medium

  * linux: 4.15.0-48.51 -proposed tracker (LP: #1822820)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * 3b080b2564287be91605bfd1d5ee985696e61d3c in ubuntu_btrfs_kernel_fixes
    triggers system hang on i386 (LP: #1812845)
    - btrfs: raid56: properly unmap parity page in finish_parity_scrub()

  * [P9][LTCTest][Opal][FW910] cpupower monitor shows multiple stop Idle_Stats
    (LP: #1719545)
    - cpupower : Fix header name to read idle state name

  * [amdgpu] screen corruption when using touchpad (LP: #1818617)
    - drm/amdgpu/gmc: steal the appropriate amount of vram for fw hand-over (v3)
    - drm/amdgpu: Free VGA stolen memory as soon as possible.

  * [SRU][B/C/OEM]IOMMU: add kernel dma protection (LP: #1820153)
    - ACPICA: AML parser: attempt to continue loading table after error
    - ACPI / property: Allow multiple property compatible _DSD entries
    - PCI / ACPI: Identify untrusted PCI devices
    - iommu/vt-d: Force IOMMU on for platform opt in hint
    - iommu/vt-d: Do not enable ATS for untrusted devices
    - thunderbolt: Export IOMMU based DMA protection support to userspace
    - iommu/vt-d: Disable ATS support on untrusted devices

  * Add basic support to NVLink2 passthrough (LP: #1819989)
    - powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is
      enabled
    - powerpc/powernv: call OPAL_QUIESCE before OPAL_SIGNAL_SYSTEM_RESET
    - powerpc/powernv: Export opal_check_token symbol
    - powerpc/powernv: Make possible for user to force a full ipl cec reboot
    - powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
    - powerpc/powernv: Move npu struct from pnv_phb to pci_controller
    - powerpc/powernv/npu: Move OPAL calls away from context manipulation
    - powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
    - powerpc/pseries/npu: Enable platform support
    - powerpc/pseries: Remove IOMMU API support for non-LPAR systems
    - powerpc/powernv/npu: Check mmio_atsd array bounds when populating
    - powerpc/powernv/npu: Fault user page into the hypervisor's pagetable

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next: hinic: fix a problem in free_tx_poll()
    - hinic: remove ndo_poll_controller
    - net-next/hinic: add checksum offload and TSO support
    - hinic: Fix l4_type parameter in hinic_task_set_tunnel_l4
    - net-next/hinic:replace multiply and division operators
    - net-next/hinic:add rx checksum offload for HiNIC
    - net-next/hinic:fix a bug in set mac address
    - net-next/hinic: fix a bug in rx data flow
    - net: hinic: fix null pointer dereference on pointer hwdev
    - hinic: optmize rx refill buffer mechanism
    - net-next/hinic:add shutdown callback
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * [CONFIG] please enable highdpi font FONT_TER16x32 (LP: #1819881)
    - Fonts: New Terminus large console font
    - [Config]: enable highdpi Terminus 16x32 font support

  * [19.04 FEAT] qeth: Enhanced link...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Jianan Wang (wangjianan-zju) wrote :

Hi there. I have a question about whether this fix is applied to kernel version 4.18.0-25? We have upgraded to this kernel version while using ubuntu 18.04 and hit this issue, so want to know which next stable version will contain this fix? Thanks.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Jianan,

The 4.18 kernel is no longer a supported kernel in Ubuntu,
since Ubuntu Cosmic/18.10 is 'End of Life' a long time ago.

You have to upgrade to the current HWE kernel (4.18 was the
HWE/hardware enablement kernel in the Cosmic timeframe)
now it's 5.4, with something like:

$ sudo apt install linux-generic-hwe-18.04

Hope this helps,
Mauricio

Revision history for this message
Jianan Wang (wangjianan-zju) wrote :

Hi Mauricio, that’s very helpful and we will try that, thanks for your input on this!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.