[ci] lvm jobs failing because of a reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Shared File Systems Service (Manila) |
Fix Released
|
High
|
Tom Barron |
Bug Description
Description
===========
The two voting LVM jobs: manila-
rax-dfw
rax-iad
rax-ord
The cause of failure seems to be because the devstack node reboots in the middle of the job
The observed sequence of operations is:
- A bunch of tests are run, successfully
- A reboot occurs
- manila-share service now cannot connect to the ephemeral vgs that were set up
- further share creation attempts to fail in the scheduler since the share service is down
- testr results are compiled, and all tests are marked as failed
This issue seems to only occur on rax nodes so far, and hasn't occurred on any of the other cloud providers that Zuul could schedule the job on, which include Vexxhost, OVH, INAP, etc.
Logs and configuration from one of the failures has been attached to this bug report.
Reducing the test job concurrency did not resolve the issue, nor did reducing the LVM driver's backing file size. These attempts have been made here:
https:/
https:/
Changed in manila: | |
importance: | Undecided → Critical |
Changed in manila: | |
assignee: | Tom Barron (tpb) → Goutham Pacha Ravi (gouthamr) |
status: | New → In Progress |
Changed in manila: | |
assignee: | Goutham Pacha Ravi (gouthamr) → Tom Barron (tpb) |
At present, with scenario tests disabled, we're not seeing reboots - it's likely something in the spinning up of virtual machines, and providing LVM shares that's affecting the test node.
One difference between RAX and other providers is that RAX uses Xen hypervisors while the others use KVM.
Might temporarily disable scenario tests to get the gate working again, and bump this issue down to "High":
https:/ /review. opendev. org/740507/ (Legacy Job) /review. opendev. org/740109/ (New Zuulv3 style job)
https:/