fullstack job failing to create a namespace, hitting kernel deadlock

Bug #1715660 reported by Thomas Morin
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
BaGPipe
Confirmed
High
Unassigned
Linux
New
Undecided
Unassigned
neutron
New
Undecided
Unassigned

Bug Description

networking-bagpipe fullstack job hits the following kernel issue when a new tests are added that use more netns's:

Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: INFO: "task ip:1358 blocked for more than 120 seconds.
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: Tainted: G OE 4.4.0-93-generic #116-Ubuntu
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: ip D ffff880166acfdc8 0 1358 1356 0x00000000
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: ffff880166acfdc8 ffff880166acfd98 ffff880205a88000 ffff8800eb29d940
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: ffff880166ad0000 ffffffff81ef78a4 ffff8800eb29d940 00000000ffffffff
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: ffffffff81ef78a8 ffff880166acfde0 ffffffff8183f0d5 ffffffff81ef78a0
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: Call Trace:
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff8183f0d5>] schedule+0x35/0x80
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff8183f37e>] schedule_preempt_disabled+0xe/0x10
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff81840fb9>] __mutex_lock_slowpath+0xb9/0x130
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff8184104f>] mutex_lock+0x1f/0x30
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff8172da4e>] copy_net_ns+0x6e/0x120
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff810a174b>] create_new_namespaces+0x11b/0x1d0
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff810a198a>] unshare_nsproxy_namespaces+0x5a/0xb0
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff81080b41>] SyS_unshare+0x1f1/0x3a0
Sep 01 14:48:40 ubuntu-xenial-rax-dfw-10739585 kernel: [<ffffffff818431f2>] entry_SYSCALL_64_fastpath+0x16/0x71

( http://logs.openstack.org/66/500066/1/check/gate-networking-bagpipe-dsvm-fullstack-ubuntu-xenial-nv/99f751d/logs/syslog.txt.gz )

(The command that is blocked is an "ip netns add ..." command.)

This happens in the openstack CI on ubuntu kernel 4.4.0-93-generic.

On another box (not openstack CI, ubuntu kernel 4.8.0-49), this issue seems correlated with a lot of "unregister_netdevice: waiting for lo to become free. Usage count = X" (with varying values for X: 1, 3, 6).

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

The ""unregister_netdevice: waiting for lo to become free" logs are also present in the Openstack CI in many kolla jobs, although these jobs don't have the kernel "task ... blocked for more than 120 seconds" message. Kolla may be hitting a different issue, or the same issue but the deadlock resolving before the 120s limit.

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

Possibly related:
* https://github.com/moby/moby/issues/5618
* bug 1711407
* https://<email address hidden>/msg179703.html

description: updated
Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

Since we are using vxlan a lot, and in particular with multiple fdb entries for 00:00:00:00:00:00, it seems that this would be a possible cause:

https://marc.info/?l=linux-netdev&m=149607876811038&w=2

Is the corresponding patch included in Ubuntu Xenial ?

Revision history for this message
Jakub Libosvar (libosvar) wrote :

It's been merged to v4.13 upstream kernel - https://github.com/torvalds/linux/commit/35cf2845563c1aaa01d27bd34d64795c4ae72700#diff-4f541554c5f8f378effc907c8f0c9115 and is not in current 4.4.0 Ubuntu Xenial.

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

This is the tooling I was mentioning earlier today:

http://www.brendangregg.com/perf.html

And the specific tracepoints that could be used to track this specific issue:

https://patchwork.ozlabs.org/patch/795005/

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I found out the patch I posted is probably not relevant as the dst_cache is not in current Ubuntu 4.4.0 kernel.

Changed in networking-bagpipe:
status: New → Confirmed
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.