virtio_scsi race can corrupt memory, panic kernel

Bug #1765241 reported by Jay Vosburgh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Jay Vosburgh
Xenial
Fix Released
Medium
Unassigned

Bug Description

        There's a race in the virtio_scsi driver (for all kernels)

        That race is inadvertently avoided on most kernels due to a
synchronize_rcu call coincidentally placed in one of the racing code paths

        By happenstance, the set of patches backported to the Ubuntu
4.4 kernel ended up without a synchronize_rcu in the relevant place. The
issue first manifests with

commit be2a20802abbde04ae09846406d7b670731d97d2
Author: Jan Kara <email address hidden>
Date: Wed Feb 8 08:05:56 2017 +0100

    block: Move bdi_unregister() to del_gendisk()

    BugLink: http://bugs.launchpad.net/bugs/1659111

The race can cause a kernel panic due to corruption of a freelist
pointer in a slab cache. The panics occur in arbitrary places as
the failure occurs at an allocation after the corruption of the
pointer. However, the most common failure observed has been within
virtio_scsi itself during probe processing, e.g.:

[ 3.111628] [<ffffffff811b0f32>] kfree_const+0x22/0x30
[ 3.112340] [<ffffffff813db534>] kobject_release+0x94/0x190
[ 3.113126] [<ffffffff813db3c7>] kobject_put+0x27/0x50
[ 3.113838] [<ffffffff8153dee7>] put_device+0x17/0x20
[ 3.114568] [<ffffffff815ac6b2>] __scsi_remove_device+0x92/0xe0
[ 3.115401] [<ffffffff815a928b>] scsi_probe_and_add_lun+0x95b/0xe80
[ 3.116287] [<ffffffff811f1083>] ? kmem_cache_alloc_trace+0x183/0x1f0
[ 3.117227] [<ffffffff8154eb0b>] ? __pm_runtime_resume+0x5b/0x80
[ 3.118048] [<ffffffff815a9eaa>] __scsi_scan_target+0x10a/0x690
[ 3.118879] [<ffffffff815aa59e>] scsi_scan_channel+0x7e/0xa0
[ 3.119653] [<ffffffff815aa743>] scsi_scan_host_selected+0xf3/0x160
[ 3.120506] [<ffffffff815aa83d>] do_scsi_scan_host+0x8d/0x90
[ 3.121295] [<ffffffff815aaa0c>] do_scan_async+0x1c/0x190
[ 3.122048] [<ffffffff810a5748>] async_run_entry_fn+0x48/0x150
[ 3.122846] [<ffffffff8109c6b5>] process_one_work+0x165/0x480
[ 3.123732] [<ffffffff8109ca1b>] worker_thread+0x4b/0x4d0
[ 3.124508] [<ffffffff8109c9d0>] ? process_one_work+0x480/0x480

Details on the race:

CPU A:

virtscsi_probe
[...]
__scsi_scan_target
scsi_probe_and_add_lun [on return calls __scsi_remove_device, below]
scsi_probe_lun
[...]
blk_execute_rq

        blk_execute_rq waits for the completion event, and then on wakeup
returns up to scsi_probe_and_all_lun, which calls __scsi_remove_device.
In order for the race to occur, the wakeup must occur on a CPU other than
CPU B.

        After being woken up by the completion of the request, the call
returns up the stack to scsi_probe_and_add_lun, which calls
__scsi_remove_device:

__scsi_remove_device
blk_cleanup_queue
[ no longer calls bdi_unregister ]
scsi_target_reap(scsi_target(sdev))
scsi_target_reap_ref_put
kref_put
kref_sub
scsi_target_reap_ref_release
scsi_target_destroy
->target_destroy() = virtscsi_target_destroy
        kfree(tgt) <=== FREE TGT

        Note that the removal of the call to bdi_unregister in commit

  xenial be2a20802abbde block: Move bdi_unregister() to del_gendisk()

        and annotated above is the change that gates whether the issue
manifests or not. The other code change from be2a20802abbde has no effect
on the race.

CPU B:

vring_interrupt
virtscsi_complete_cmd
scsi_mq_done (via ->scsi_done())
scsi_mq_done
blk_mq_complete_request
__blk_mq_complete_request
[...]
blk_end_sync_rq
complete
[ wake up the task from CPU A ]

        After waking the CPU A task, execution returns up the stack, and
then calls atomic_dec(&tgt->reqs) in virtscsi_complete_cmd immediately
after returning from the call to ->scsi_done.

        If the activity on CPU A after it is woken up (starting at
__scsi_remove_device) finishes before CPU B can call atomic_dec() in
virtscsi_complete_cmd, then the atomic_dec() will modify a free list
pointer in the freed slab object that contained tgt. This causes the
system to panic on a subsequent allocation from the per-cpu slab cache.

        The call path on CPU B is significantly shorter than that on CPU A
after wakeup, so it is likely that an external event delays CPU B. This
could be either an interrupt within the VM or scheduling delays for the
virtual cpu process on the hypervisor. Whatever the delay is, it is not
the root cause but merely the triggering event.

        The virtscsi race window described above exists in all kernels
that have been checked (4.4 upstream LTS, Ubuntu 4.13 and 4.15, and
current mainline as of this writing). However, none of those kernels
exhibit the panic in testing, only the Ubuntu 4.4 kernel after commit
be2a20802abbde.

        The reason none of those kernels panic is they all have one thing
in common: an incidental call to synchronize_rcu somewhere in the call
path of CPU A after it is woken up. This causes CPU A to wait for CPU B's
activity, as CPU A will not go on to free the "tgt" memory until after the
RCU grace period passes, which requires waiting for CPU B's activity to
finish. Note that the specific RCU sync call is different between some of
those kernel versions, but all of them have one somewhere deep inside
blk_cleanup_queue. The bdi_unregister function has one (in the call to
bdi_remove_from_list), which is why removing that call opens the race
window on the Ubuntu 4.4 kernel.

        Resolving the issue can be accomplished by adding an RCU sync
to virtscsi_target_destroy prior to freeing the target. It is also possible
to use a loop of the format:

+ while (atomic_read(&tgt->reqs))
+ cpu_relax();

        but this is higher risk as the loop is non-terminating in the case
of other failure.

Jay Vosburgh (jvosburgh)
Changed in linux (Ubuntu):
assignee: nobody → Jay Vosburgh (jvosburgh)
status: New → Confirmed
Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

SRU Justification:

Impact:

 This issue can cause system panics of systems using the
virtio_scsi driver with the affected Ubuntu kernels. The issue manifests
irregularly, as it is timing dependent.

Fix:

 The issue is resolved by adding synchronization between the two
code paths that race with one another. The most straightforward fix
is to have the code wait for any outstanding
requests to drain prior to freeing the target structure, e.g.,

--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -762,6 +762,10 @@ static int virtscsi_target_alloc(struct scsi_target *starget)
 static void virtscsi_target_destroy(struct scsi_target *starget)
 {
        struct virtio_scsi_target_state *tgt = starget->hostdata;
+
+ /* we can race with concurrent virtscsi_complete_cmd */
+ while (atomic_read(&tgt->reqs))
+ cpu_relax();
        kfree(tgt);
 }

An alternative fix that was considered is to use a synchronize_rcu_expedited
call, as that is the functionality that blocks the race in unaffected kernels.
However, some call paths into virtscsi_target_destroy may hold mutexes that
are not held by the upstream RCU sync calls (which enter via the block layer).
For this reason the more confined fix described above was chosen.

Testcase:

This reproduces on Google Cloud, using the current, unmodified
ubuntu-1404-lts image (with the Ubuntu 4.4 kernel). Using the two attached
scripts, run e.g.

  ./create_shutdown_instance.sh 100

to create 100 instances. If an instance runs its startup script
successfully, it'll shut itself down right away. So instances that are
still running after a few minutes likely demonstrate this problem.

The issue reproduces easily with n1-standard-4.

create_shutdown_instance.sh:

#!/bin/bash -e

ZONE=us-central1-a

for i in $(seq -w $1); do
  gcloud compute instances create shutdown-experiment-$i \
    --zone="${ZONE}" \
    --image-family=ubuntu-1404-lts \
    --image-project=ubuntu-os-cloud \
    --machine-type=n1-standard-4 \
    --scopes compute-rw \
    --metadata-from-file startup-script=immediate_shutdown.sh &
done

wait

immediate_shutdown.sh:

#!/bin/bash -x

function get_metadata_value() {
  curl -H 'Metadata-Flavor: Google' \
    "http://metadata.google.internal/computeMetadata/v1/instance/$1"
}

readonly ZONE="$(get_metadata_value zone | awk -F'/' '{print $NF}')"
gcloud compute instances delete "$(hostname)" --zone="${ZONE}" --quiet

Stefan Bader (smb)
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
status: New → In Progress
Stefan Bader (smb)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Jay Vosburgh (jvosburgh)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (59.3 KiB)

This bug was fixed in the package linux - 4.4.0-127.153

---------------
linux (4.4.0-127.153) xenial; urgency=medium

  * CVE-2018-3639 (powerpc)
    - powerpc/pseries: Support firmware disable of RFI flush
    - powerpc/powernv: Support firmware disable of RFI flush
    - powerpc/rfi-flush: Move the logic to avoid a redo into the debugfs code
    - powerpc/rfi-flush: Make it possible to call setup_rfi_flush() again
    - powerpc/rfi-flush: Always enable fallback flush on pseries
    - powerpc/rfi-flush: Differentiate enabled and patched flush types
    - powerpc/rfi-flush: Call setup_rfi_flush() after LPM migration
    - powerpc/pseries: Add new H_GET_CPU_CHARACTERISTICS flags
    - powerpc: Add security feature flags for Spectre/Meltdown
    - powerpc/pseries: Set or clear security feature flags
    - powerpc/powernv: Set or clear security feature flags
    - powerpc/64s: Move cpu_show_meltdown()
    - powerpc/64s: Enhance the information in cpu_show_meltdown()
    - powerpc/powernv: Use the security flags in pnv_setup_rfi_flush()
    - powerpc/pseries: Use the security flags in pseries_setup_rfi_flush()
    - powerpc/64s: Wire up cpu_show_spectre_v1()
    - powerpc/64s: Wire up cpu_show_spectre_v2()
    - powerpc/pseries: Fix clearing of security feature flags
    - powerpc: Move default security feature flags
    - powerpc/pseries: Restore default security feature flags on setup
    - SAUCE: powerpc/64s: Add support for a store forwarding barrier at kernel
      entry/exit

  * CVE-2018-3639 (x86)
    - SAUCE: Clean up IBPB and IBRS control functions and macros
    - SAUCE: Fix up IBPB and IBRS kernel parameters documentation
    - SAUCE: Remove #define X86_FEATURE_PTI
    - x86/cpufeature: Move some of the scattered feature bits to x86_capability
    - x86/cpufeature: Cleanup get_cpu_cap()
    - x86/cpu: Probe CPUID leaf 6 even when cpuid_level == 6
    - x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
    - x86/cpufeatures: Add Intel feature bits for Speculation Control
    - SAUCE: x86/kvm: Expose SPEC_CTRL from the leaf
    - x86/cpufeatures: Add AMD feature bits for Speculation Control
    - x86/msr: Add definitions for new speculation control MSRs
    - SAUCE: x86/msr: Rename MSR spec control feature bits
    - x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
    - x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
    - x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
    - x86/speculation: Add <asm/msr-index.h> dependency
    - x86/cpufeatures: Clean up Spectre v2 related CPUID flags
    - x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
    - SAUCE: x86/speculation: Move vendor specific IBRS/IBPB control code
    - SAUCE: x86: Add alternative_msr_write
    - SAUCE: x86/nospec: Simplify alternative_msr_write()
    - SAUCE: x86/bugs: Concentrate bug detection into a separate function
    - SAUCE: x86/bugs: Concentrate bug reporting into a separate function
    - arch: Introduce post-init read-only memory
    - SAUCE: x86/bugs: Read SPEC_CTRL MSR during boot and re-use reserved bits
    - SAUCE: x86/bugs, KVM: Support the combination of guest a...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.