[ci] lvm jobs failing because of a reboot

Bug #1886988 reported by Goutham Pacha Ravi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Fix Released
High
Tom Barron

Bug Description

Description
===========
The two voting LVM jobs: manila-tempest-plugin-lvm and manila-tempest-minimal-lvm-ipv6-only started failing at the beginning of July 2020. The failures were observed consistently on rax nodes from the following Rackspace regions:

 rax-dfw
 rax-iad
 rax-ord

The cause of failure seems to be because the devstack node reboots in the middle of the job

The observed sequence of operations is:
- A bunch of tests are run, successfully
- A reboot occurs
- manila-share service now cannot connect to the ephemeral vgs that were set up
- further share creation attempts to fail in the scheduler since the share service is down
- testr results are compiled, and all tests are marked as failed

This issue seems to only occur on rax nodes so far, and hasn't occurred on any of the other cloud providers that Zuul could schedule the job on, which include Vexxhost, OVH, INAP, etc.

Logs and configuration from one of the failures has been attached to this bug report.

Reducing the test job concurrency did not resolve the issue, nor did reducing the LVM driver's backing file size. These attempts have been made here:

https://review.opendev.org/#/c/739775/5 (concurrency was set to 1, reboot still occurred)
https://review.opendev.org/#/c/740109/ (backing file size was halved, reboot still occurred)

Changed in manila:
importance: Undecided → Critical
Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :
Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

At present, with scenario tests disabled, we're not seeing reboots - it's likely something in the spinning up of virtual machines, and providing LVM shares that's affecting the test node.

One difference between RAX and other providers is that RAX uses Xen hypervisors while the others use KVM.

Might temporarily disable scenario tests to get the gate working again, and bump this issue down to "High":

https://review.opendev.org/740507/ (Legacy Job)
https://review.opendev.org/740109/ (New Zuulv3 style job)

Revision history for this message
Tom Barron (tpb) wrote :

Got a reboot during run on PS#6 of https://review.opendev.org/#/c/740265/ on rax-ord. Attaching zuul logs and netconsole capture log before the kernel panic, which appears to be due to this regression [1] in the 4.15.0-109 linux kernel and to affect Ubuntu focal. As I write, a patch has been proposed and awaits verification.

From the console at panic;

[ 8067.975296] general protection fault: 0000 [#1] SMP PTI
...
[ 8068.009514] CPU: 7 PID: 14143 Comm: handler59 Not tainted 4.15.0-109-generic #110-Ubuntu
[ 8068.012245] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 2:1.10.2-58953eb7 04/01/2014
[ 8068.014937] RIP: 0010:__cgroup_bpf_run_filter_skb+0xbb/0x1e0
...

[1] https://<email address hidden>/msg413671.html

Revision history for this message
Tom Barron (tpb) wrote :
Revision history for this message
Tom Barron (tpb) wrote :

Note that the above panic was triggered running with DNM patch https://review.opendev.org/#/c/740551 which re-enables the scenario tests for the purpose of working this bug.

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

> Attaching zuul logs and netconsole capture log before the kernel panic, which appears to be due to this regression [1] in the 4.15.0-109 linux kernel and to affect Ubuntu focal. As I write, a patch has been proposed and awaits verification.

Great stuff, thank you for your effort in investigating and isolating the root cause Tom. Since we've currently disabled scenario tests in the LVM job, I'll bump the priority down to "High"; and keep tab of the kernel fix on https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1886668.

Changed in manila:
importance: Critical → High
assignee: nobody → Tom Barron (tpb)
milestone: none → victoria-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to manila-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/741384

Changed in manila:
assignee: Tom Barron (tpb) → Goutham Pacha Ravi (gouthamr)
status: New → In Progress
Changed in manila:
assignee: Goutham Pacha Ravi (gouthamr) → Tom Barron (tpb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila-tempest-plugin (master)

Reviewed: https://review.opendev.org/741384
Committed: https://git.openstack.org/cgit/openstack/manila-tempest-plugin/commit/?id=170cc45947459384b62a7f899f8c65fa7081bb3e
Submitter: Zuul
Branch: master

commit 170cc45947459384b62a7f899f8c65fa7081bb3e
Author: Goutham Pacha Ravi <email address hidden>
Date: Wed Jul 15 23:42:05 2020 -0700

    [ci] Re-enable scenario tests in the LVM job

    Bug #1886668 has now been addressed in ubuntu, and
    the new kernel no longer suffers from the problem
    that led to mid-test reboots like the previous ones
    did. So we can safely re-enable scenario tests in the
    gating LVM job.

    Change-Id: Iefcacfb83262eb8441fd524b4703491980b6a9d7
    Related-Bug: #1886668
    Closes-Bug: #1886988
    Signed-off-by: Goutham Pacha Ravi <email address hidden>

Changed in manila:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila (master)

Reviewed: https://review.opendev.org/740551
Committed: https://git.openstack.org/cgit/openstack/manila/commit/?id=3fc96e3ad87d88f0498f5b47f22488043e46e57a
Submitter: Zuul
Branch: master

commit 3fc96e3ad87d88f0498f5b47f22488043e46e57a
Author: Tom Barron <email address hidden>
Date: Sat Jul 11 08:43:28 2020 -0400

    [ci] Re-enable scenario tests for lvm job

    We temporarily disabled scenario tests in
    the voting LVM job [1] because of a bug
    in which a kernel problem caused a reboot
    in the middle of the job when it ran in a
    Xen virtualization environment on rax nodes.

    The kernel has been updated on those nodes
    and the problem has gone away, as evidenced
    in various checks jobs in this review, so
    let's reenable the scenario tests.
        .
    [1] https://review.opendev.org/#/c/740507/
    [2] https://launchpad.net/bugs/1886988

    Closes-bug: #1886988

    Change-Id: I501d0b6537653613a58ea1fe606f6ef66b8b6d38

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.