Comment 4 for bug 1832915

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

With verbose my numad log file is:

Mon Jun 17 06:22:53 2019: Nodes: 2
Min CPUs free: 1416, Max CPUs: 1423, Avg CPUs: 1419, StdDev: 3.53553
Min MBs free: 12869, Max MBs: 13756, Avg MBs: 13312, StdDev: 443.5
Node 0: MBs_total 65266, MBs_free 12869, CPUs_total 2000, CPUs_free 1416, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 13756, CPUs_total 2000, CPUs_free 1423, Distance: 40 10 CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:22:53 2019: Processes: 1563
Mon Jun 17 06:22:53 2019: Candidates: 2
101867853: PID 120072: (qemu-system-ppc), Threads 23, MBs_size 55763, MBs_used 50509, CPUs_used 876, Magnitude 44245884, Nodes: 0,8
101867853: PID 120206: (qemu-system-ppc), Threads 23, MBs_size 55821, MBs_used 23699, CPUs_used 279, Magnitude 6612021, Nodes: 0,8
Mon Jun 17 06:22:53 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

With debug the dying message looked like:

Another run #2:
Mon Jun 17 06:25:08 2019: Nodes: 2
Min CPUs free: 302, Max CPUs: 439, Avg CPUs: 370, StdDev: 68.5018
Min MBs free: 1597, Max MBs: 4548, Avg MBs: 3072, StdDev: 1475.5
Node 0: MBs_total 65266, MBs_free 1597, CPUs_total 2000, CPUs_free 302, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 4548, CPUs_total 2000, CPUs_free 439, Distance: 40 10 CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:25:08 2019: Processes: 1572
Mon Jun 17 06:25:08 2019: Candidates: 2
101881395: PID 120072: (qemu-system-ppc), Threads 25, MBs_size 55763, MBs_used 50523, CPUs_used 1995, Magnitude 100793385, Nodes: 0,8
101881395: PID 120206: (qemu-system-ppc), Threads 25, MBs_size 55821, MBs_used 45916, CPUs_used 830, Magnitude 38110280, Nodes: 0,8
Mon Jun 17 06:25:08 2019: PICK NODES FOR: PID: 120072, CPUs 2347, MBs 59438
Mon Jun 17 06:25:08 2019: PROCESS_MBs[0]: 17481
Mon Jun 17 06:25:08 2019: Node[0]: mem: 201700 cpu: 5952
Mon Jun 17 06:25:08 2019: Node[1]: mem: 45480 cpu: 2634
Mon Jun 17 06:25:08 2019: Totmag[0]: 12080055
Mon Jun 17 06:25:08 2019: Totmag[1]: 1948267
Mon Jun 17 06:25:08 2019: best_node_ix: 0
Mon Jun 17 06:25:08 2019: Node: 0 Dist: 10 Magnitude: 1200518400
Mon Jun 17 06:25:08 2019: Node: 8 Dist: 40 Magnitude: 119794320
Mon Jun 17 06:25:08 2019: MBs: 59438, CPUs: 2347
Mon Jun 17 06:25:08 2019: Assigning resources from node 0
Mon Jun 17 06:25:08 2019: Node[0]: mem: 1000 cpu: 0
Mon Jun 17 06:25:08 2019: MBs: 39368, CPUs: 1355
Mon Jun 17 06:25:08 2019: Assigning resources from node 1
Mon Jun 17 06:25:08 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

Another run #3:
Mon Jun 17 06:26:46 2019: Nodes: 2
Min CPUs free: 889, Max CPUs: 1048, Avg CPUs: 968, StdDev: 79.5016
Min MBs free: 1291, Max MBs: 3484, Avg MBs: 2387, StdDev: 1096.5
Node 0: MBs_total 65266, MBs_free 1291, CPUs_total 2000, CPUs_free 889, Distance: 10 40 CPUs: 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76
Node 1: MBs_total 65337, MBs_free 3484, CPUs_total 2000, CPUs_free 1048, Distance: 40 10 CPUs: 80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156
Mon Jun 17 06:26:46 2019: Processes: 1546
Mon Jun 17 06:26:46 2019: Candidates: 2
101891156: PID 120072: (qemu-system-ppc), Threads 23, MBs_size 55763, MBs_used 50593, CPUs_used 1437, Magnitude 72702141, Nodes: 0,8
101891156: PID 120206: (qemu-system-ppc), Threads 23, MBs_size 55821, MBs_used 48065, CPUs_used 613, Magnitude 29463845, Nodes: 0,8
Mon Jun 17 06:26:46 2019: PICK NODES FOR: PID: 120072, CPUs 1690, MBs 59521
Mon Jun 17 06:26:46 2019: PROCESS_MBs[0]: 17527
Mon Jun 17 06:26:46 2019: Node[0]: mem: 199130 cpu: 8316
Mon Jun 17 06:26:46 2019: Node[1]: mem: 34840 cpu: 6288
Mon Jun 17 06:26:46 2019: Totmag[0]: 16559650
Mon Jun 17 06:26:46 2019: Totmag[1]: 2190739
Mon Jun 17 06:26:46 2019: best_node_ix: 0
Mon Jun 17 06:26:46 2019: Node: 0 Dist: 10 Magnitude: 1655965080
Mon Jun 17 06:26:46 2019: Node: 8 Dist: 40 Magnitude: 219073920
Mon Jun 17 06:26:46 2019: MBs: 59521, CPUs: 1690
Mon Jun 17 06:26:46 2019: Assigning resources from node 0
Mon Jun 17 06:26:46 2019: Node[0]: mem: 1000 cpu: 0
Mon Jun 17 06:26:46 2019: MBs: 39708, CPUs: 304
Mon Jun 17 06:26:46 2019: Assigning resources from node 1
Mon Jun 17 06:26:46 2019: Advising pid 120072 (qemu-system-ppc) move from nodes (0,8) to nodes (0,8)

Your crash was around:
Thu Feb 21 00:12:10 2019: Assigning resources from node 5
Thu Feb 21 00:12:10 2019: Assigning resources from node 2
Thu Feb 21 00:12:10 2019: Process 88781 already 100 percent localized to target nodes.

Mine seems to be as soon as it hits "Assigning resources" as well.
This is something the daemon will do anyway, but obviously more often with actual memory load.
So far all fits together, lets try to find what it accesses when failing.