The 1st hard lockup is harder to get the interesting data out of, as apparently the registers with variables
related to the cpu number have been clobbered by more recent calls in the spinlock path.
Looking at the 2nd hard lockup:
addr2line + code shows us that try_to_wake_up() in line 1997 is indeed looping with IRQs disabled in line 1939 (thus a hard lockup):
$ addr2line -pifae ddeb-116.140/usr/lib/debug/boot/vmlinux-4.4.0-116-generic 0xffffffff810aacb6
0xffffffff810aacb6: try_to_wake_up at /build/linux-lts-xenial-ozsla7/linux-lts-xenial-4.4.0/kernel/sched/core.c:1997
1926 static int
1927 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
1928 {
...
1939 raw_spin_lock_irqsave(&p->pi_lock, flags);
...
1993 /*
1994 * If the owning (remote) cpu is still in the middle of schedule() with
1995 * this task as prev, wait until its done referencing the task.
1996 */
1997 while (p->on_cpu)
1998 cpu_relax();
...
2027 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2028
2029 return success;
2030 }
The objdump disassembly of try_to_wake_up() in vmlinux for the RIP instruction address (ffffffff810aacb6),
shows a while loop that just checks for non-zero 'p->on_cpu' and calls cpu_relax() (which translates to the 'pause' instruction):
And that matches the other hard locked up task on CPU 10 (see its 'task:' field).
Per the stack trace in CPU 10, and the identical timestamp of the two hard lockup messages, and the fact both stack traces are cpu_stopper related,
it does look like CPU 10 is waiting on the spinlock of one of the 2 cpu stoppers held by CPU 6, which is exactly the scenario in the suggested patch.
The problem/fix has been verified with a synthetic test-case (attached).
Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.
<...>
The stop_two_cpus() call can only happen in a NUMA system per it's caller chain:
stop_two_cpus() <- migrate_swap() <- task_numa_migrate() <- numa_migrate_preferred() <- [task_numa_placement()] <- task_numa_fault()
Analysis
--------
The 1st hard lockup is harder to get the interesting data out of, as apparently the registers with variables
related to the cpu number have been clobbered by more recent calls in the spinlock path.
Looking at the 2nd hard lockup:
addr2line + code shows us that try_to_wake_up() in line 1997 is indeed looping with IRQs disabled in line 1939 (thus a hard lockup):
$ addr2line -pifae ddeb-116. 140/usr/ lib/debug/ boot/vmlinux- 4.4.0-116- generic 0xffffffff810aacb6 10aacb6: try_to_wake_up at /build/ linux-lts- xenial- ozsla7/ linux-lts- xenial- 4.4.0/kernel/ sched/core. c:1997
0xffffffff8
1926 static int wake_up( struct task_struct *p, unsigned int state, int wake_flags) lock_irqsave( &p->pi_ lock, flags); unlock_ irqrestore( &p->pi_ lock, flags);
1927 try_to_
1928 {
...
1939 raw_spin_
...
1993 /*
1994 * If the owning (remote) cpu is still in the middle of schedule() with
1995 * this task as prev, wait until its done referencing the task.
1996 */
1997 while (p->on_cpu)
1998 cpu_relax();
...
2027 raw_spin_
2028
2029 return success;
2030 }
The objdump disassembly of try_to_wake_up() in vmlinux for the RIP instruction address (ffffffff810aacb6),
shows a while loop that just checks for non-zero 'p->on_cpu' and calls cpu_relax() (which translates to the 'pause' instruction):
ffffffff810 aacb1: f3 90 pause aacb3: 8b 43 28 mov 0x28(%rbx),%eax aacb6: 85 c0 test %eax,%eax aacb8: 75 f7 jne ffffffff810aacb1 <try_to_ wake_up+ 0x81>
ffffffff810
ffffffff810
ffffffff810
So, it checks for the value in pointer in RBX + offset 0x28, which according to the 'pahole' tool, is indeed the 'on_cpu' field:
$ pahole --hex -C task_struct ddeb-116. 140/usr/ lib/debug/ boot/vmlinux- 4.4.0-116- generic | grep on_cpu
int on_cpu; /* 0x28 0x4 */
So, the task_struct pointer is in RBX, which is:
RBX: ffff883ff2a76200
And that matches the other hard locked up task on CPU 10 (see its 'task:' field).
Per the stack trace in CPU 10, and the identical timestamp of the two hard lockup messages, and the fact both stack traces are cpu_stopper related,
it does look like CPU 10 is waiting on the spinlock of one of the 2 cpu stoppers held by CPU 6, which is exactly the scenario in the suggested patch.
The problem/fix has been verified with a synthetic test-case (attached).
commit 0b26351b910fb8f e6a056f8a1bbcca be50c0e19f
Author: Peter Zijlstra <email address hidden>
Date: Fri Apr 20 11:50:05 2018 +0200
stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
Matt reported the following deadlock:
CPU0 CPU1
schedule( .prev=migrate/ 0) <fault> next_task( ) ...
idle_balance( ) migrate_swap()
active_ balance( ) stop_two_cpus()
spin_ lock(stopper0- >lock)
spin_ lock(stopper1- >lock)
ttwu( migrate/ 0)
smp_ cond_load_ acquire( ) -- waits for schedule()
stop_ one_cpu( 1)
spin_lock( stopper1- >lock) -- waits for stopper lock
pick_
Fix this deadlock by taking the wakeups out from under stopper->lock.
This allows the active_balance() to queue the stop work and finish the
context switch, which in turn allows the wakeup from migrate_swap() to
observe the context and complete the wakeup.
<...>
The stop_two_cpus() call can only happen in a NUMA system per it's caller chain: preferred( ) <- [task_numa_ placement( )] <- task_numa_fault()
stop_two_cpus() <- migrate_swap() <- task_numa_migrate() <- numa_migrate_