kernel crash : net_sched race condition in tcindex_destroy()

Bug #1825942 reported by Viktor S. Wold Eide
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Andrea Righi
Bionic
Fix Released
High
Andrea Righi

Bug Description

[Impact]

It is possible to trigger a NULL pointer dereference in tcindex_delete() with a simple reproducer script, this is because in tcindex_set_parms() when old_r doesn't exist we set the new exts to cr.exts that can be uninitialized, triggering the NULL pointer dereference.

In addition to that we may also hit a race condition in tcindex_destroy() (as pointed out in the original bug report and also here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921542#10), that is also fixed upstream, but it requires 4b79817f7add "net_sched: switch to rcu_work".

However adding these changes introduces three memory leak problems in cls_tcindex (that can be easily verified using the same test case). These leaks are also fixed upstream by 711ff09f3330 "net_sched: fix a memory leak in cls_tcindex" and 000d2aeda70c "net_sched: fix two more memory leaks in cls_tcindex", so we need to backport also these two additional fixes.

After all these fixes are applied the test case doesn't seem to trigger any bug.

[Test Case]

#!/bin/sh -ex

modprobe ifb

while true; do
    tc qdisc add dev ifb0 root handle 2:0 prio bands 5
    tc qdisc add dev ifb0 parent 2:5 sfq
    tc filter add dev ifb0 parent 2:0 protocol ip prio 5 handle 0 tcindex mask 0 classid 2:5 pass_on
    tc qdisc del dev ifb0 root || true
done

[Fix]

 * Fixes required to solve this problem:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2df8bee5654bb2b7312662ca6810d4dc16b0b67f
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8015d93ebd27484418d4952284fd02172fa4b0b2
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=033b228e7f26b29ae37f8bfa1bc6b209a5365e9f
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1db817e75f5b9387b8db11e37d5f0624eb9223e0

[Regression Potential]

 * All upstream fixes, tested on the affected platform, backport changes are minimal.

[Original bug report]

I am running into a kernel crash issue using latest Ubuntu 4.15 kernel.
It does not appear to have been fixed in Ubuntu-4.15.0-48.51.

This crash has also been reported for debian:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921542

The kernel crash issue was fixed in February in the Linux kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=056a17982adbd52b2a6c5ec6266cee4521cd931b

I did test one of the recent kernel-ppa/mainline kernels, more specifically:
linux-image-unsigned-4.19.34-041934-generic_4.19.34-041934.201904051741_amd64.deb
It seems to fix the problem, that is, no crashes experienced so far.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1825942

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

I should have added that there were a couple of other commits related to this issue (memory leaks in cls_tcindex) that were also merged in.

Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1825942

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

Log files should not be required, as this issue is already confirmed and also fixed in the Linux kernel.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

I could have been more explicit. The thread linked to in the initial bug description also contains a simplified script by Ben Hutchings <email address hidden> that triggers the kernel crash (included below for the reference):
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921542#10

The script triggers the kernel crash in the latest available Ubuntu kernel, that is 4.15.0.48, which affects bionic and xenial linux-hwe.

--- BEGIN ---
#!/bin/sh -ex

modprobe ifb

while true; do
    tc qdisc add dev ifb0 root handle 2:0 prio bands 5
    tc qdisc add dev ifb0 parent 2:5 sfq
    tc filter add dev ifb0 parent 2:0 protocol ip prio 5 handle 0 tcindex mask 0 classid 2:5 pass_on
    tc qdisc del dev ifb0 root || true
done
--- END ---

Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

This crash is currently critical when using traffic control (tc) in one of the Ubuntu LTS releases, bionic and xenial linux-hwe.

I referred to a simple script in the debian bug tracking system that triggers the kernel crash. In my case a normal shutdown/reboot triggers the crash, when the kernel tries to perform cleanup for tc. This leaves the system hanging in a crashed state.

In the debian bug reporting system this bug had severity critical and it was fixed March 12th 2019.

Is there anything that can be done in order to get this fixed for Ubuntu LTS (bionic and xenial linux-hwe) during the SRU cycle 13-May through 02-June ?

Andrea Righi (arighi)
Changed in linux (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
importance: Undecided → Medium
Andrea Righi (arighi)
tags: added: bionic cosmic
Andrea Righi (arighi)
description: updated
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote : Re: [Bug 1825942] Re: kernel crash : net_sched race condition in tcindex_destroy()

Thanks a lot. That's great.

Revision history for this message
Andrea Righi (arighi) wrote :

Fix against bionic submitted to the kernel ML: https://lists.ubuntu.com/archives/kernel-team/2019-May/100741.html

Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

I did expect a change in status to "Fix Committed", indicating that the fix had been applied?

Andrea Righi (arighi)
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Andrea Righi (arighi)
Changed in linux (Ubuntu):
importance: Medium → High
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

As I understood it, the fix was unfortunately some days to late for the he previous Kernel SRU cycle. I then expected the fix to be applied for the current SRU cycle, that is, 03-Jun through 30-Jun. I still assume that is the case?

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi Viktor,

These patches were not applied for the SRU cycle "03-Jun through 30-Jun", they are committed now for the cycle starting next week ("01-Jul through 21-Jul"). Please check https://kernel.ubuntu.com/ for the current schedules.

Thank you.

Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Viktor S. Wold Eide (viktor-s-wold-eide) wrote :

I have now tested the updated Linux kernel from bionic proposed. The
new kernel seems to solve the problem and the fix appears OK for :

linux-image-generic 4.15.0.55.57 amd64 Generic Linux kernel image
Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Hence, I change the tag to verification-done-bionic

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.2 KiB)

This bug was fixed in the package linux - 4.15.0-55.60

---------------
linux (4.15.0-55.60) bionic; urgency=medium

  * linux: 4.15.0-55.60 -proposed tracker (LP: #1834954)

  * Request backport of ceph commits into bionic (LP: #1834235)
    - ceph: use atomic_t for ceph_inode_info::i_shared_gen
    - ceph: define argument structure for handle_cap_grant
    - ceph: flush pending works before shutdown super
    - ceph: send cap releases more aggressively
    - ceph: single workqueue for inode related works
    - ceph: avoid dereferencing invalid pointer during cached readdir
    - ceph: quota: add initial infrastructure to support cephfs quotas
    - ceph: quota: support for ceph.quota.max_files
    - ceph: quota: don't allow cross-quota renames
    - ceph: fix root quota realm check
    - ceph: quota: support for ceph.quota.max_bytes
    - ceph: quota: update MDS when max_bytes is approaching
    - ceph: quota: add counter for snaprealms with quota
    - ceph: avoid iput_final() while holding mutex or in dispatch thread

  * QCA9377 isn't being recognized sometimes (LP: #1757218)
    - SAUCE: USB: Disable USB2 LPM at shutdown

  * hns: fix ICMP6 neighbor solicitation messages discard problem (LP: #1833140)
    - net: hns: fix ICMP6 neighbor solicitation messages discard problem
    - net: hns: fix unsigned comparison to less than zero

  * Fix occasional boot time crash in hns driver (LP: #1833138)
    - net: hns: Fix probabilistic memory overwrite when HNS driver initialized

  * use-after-free in hns_nic_net_xmit_hw (LP: #1833136)
    - net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()

  * hns: attempt to restart autoneg when disabled should report error
    (LP: #1833147)
    - net: hns: Restart autoneg need return failed when autoneg off

  * systemd 237-3ubuntu10.14 ADT test failure on Bionic ppc64el (test-seccomp)
    (LP: #1821625)
    - powerpc: sys_pkey_alloc() and sys_pkey_free() system calls
    - powerpc: sys_pkey_mprotect() system call

  * [UBUNTU] pkey: Indicate old mkvp only if old and curr. mkvp are different
    (LP: #1832625)
    - pkey: Indicate old mkvp only if old and current mkvp are different

  * [UBUNTU] kernel: Fix gcm-aes-s390 wrong scatter-gather list processing
    (LP: #1832623)
    - s390/crypto: fix gcm-aes-s390 selftest failures

  * System crashes on hot adding a core with drmgr command (4.15.0-48-generic)
    (LP: #1833716)
    - powerpc/numa: improve control of topology updates
    - powerpc/numa: document topology_updates_enabled, disable by default

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * [UBUNTU] kernel: Fix wrong dispatching for control domain CPRBs
    (LP: #1832624)
    - s390/zcrypt: Fix wrong dispatching for control domain CPRBs

  * CVE-2019-11815
    - net: rds: force to destroy connection if t_sock is NULL in
      rds_tcp_kill_sock().

  * Sound device not detected after resume from hibernate (LP: #1826868)
    - drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled
    - drm/i915: Save the old CDCLK atomic state
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.