test_shelve_volume_backed_instance intermittently fails guest ssh with dhcp lease fail
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tempest |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Seen here but I've seen this a lot in the last few weeks:
https:/
We have bug 1848078 for the generic guest dhcp lease failure causing ssh to fail, but test_shelve_
2019-11-21 04:20:16.499344 | controller | {2} tempest.
2019-11-21 04:20:16.499446 | controller |
2019-11-21 04:20:16.499509 | controller | Captured traceback:
2019-11-21 04:20:16.499542 | controller | ~~~~~~~~~~~~~~~~~~~
2019-11-21 04:20:16.499588 | controller | b'Traceback (most recent call last):'
2019-11-21 04:20:16.499678 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.499712 | controller | b' sock=proxy_chan)'
2019-11-21 04:20:16.499835 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.499889 | controller | b' raise NoValidConnecti
2019-11-21 04:20:16.499994 | controller | b'paramiko.
2019-11-21 04:20:16.500016 | controller | b''
2019-11-21 04:20:16.500090 | controller | b'During handling of the above exception, another exception occurred:'
2019-11-21 04:20:16.500108 | controller | b''
2019-11-21 04:20:16.500153 | controller | b'Traceback (most recent call last):'
2019-11-21 04:20:16.500237 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.500292 | controller | b' return f(*func_args, **func_kwargs)'
2019-11-21 04:20:16.500407 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.500484 | controller | b' self._create_
2019-11-21 04:20:16.500622 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.500666 | controller | b' server=server)'
2019-11-21 04:20:16.500753 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.500785 | controller | b' server=server)'
2019-11-21 04:20:16.500874 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.500936 | controller | b' linux_client.
2019-11-21 04:20:16.501034 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501086 | controller | b' six.reraise(
2019-11-21 04:20:16.501184 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501214 | controller | b' raise value'
2019-11-21 04:20:16.501310 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501407 | controller | b' return function(self, *args, **kwargs)'
2019-11-21 04:20:16.501531 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501583 | controller | b' self.ssh_
2019-11-21 04:20:16.501673 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501726 | controller | b' connection = self._get_
2019-11-21 04:20:16.501819 | controller | b' File "/opt/stack/
2019-11-21 04:20:16.501858 | controller | b' password=
2019-11-21 04:20:16.501944 | controller | b'tempest.
2019-11-21 04:20:16.501985 | controller | b'User: cirros, Password: None'
2019-11-21 04:20:16.502002 | controller | b'
From the guest console output:
2019-11-21 04:20:16.674184 | controller | b'Starting network...'
2019-11-21 04:20:16.674234 | controller | b'udhcpc (v1.23.2) started'
2019-11-21 04:20:16.674279 | controller | b'Sending discover...'
2019-11-21 04:20:16.674326 | controller | b'Sending discover...'
2019-11-21 04:20:16.674369 | controller | b'Sending discover...'
2019-11-21 04:20:16.674431 | controller | b'Usage: /sbin/cirros-dhcpc <up|down>'
2019-11-21 04:20:16.674489 | controller | b'No lease, failing'
2019-11-21 04:20:16.674557 | controller | b'WARN: /etc/rc3.
This is where the test fails:
So that's after the server is shelved and then unshelved. I wonder if the guest is dirty somehow when the snapshot is created. The create_timestamp method does run sync though on the guest:
Other things to think about with this is if it shows up mostly in multinode jobs, does the server get unshelved on the other compute node from the original one it was running on and would that make a difference? It shouldn't. Or is it something to do with slow nodes when this fails?
14 hits in 7 days on two multinode jobs.
I also wonder if https:/
Also note that in this test we're using a volume-backed server so we're writing to the root volume. I wonder if that is getting flushed properly when we disconnect and reconnect the volume? And I wonder if the test would be more stable if we wrote the timestamp to a secondary attached blank volume rather than the root volume that contains the OS image.
Confirmed that in the case above the instance is created and then shelved on the primary node:
https:/ /zuul.opendev. org/t/openstack /build/ 1a340abeb898414 09fd9940f5af313 48/log/ controller/ logs/screen- n-cpu.txt. gz#28281
Nov 21 04:16:25.005754 ubuntu- bionic- ovh-bhs1- 0012956850 nova-compute[ 26306]: INFO nova.compute. manager [None req-8f552e87- 036a-4cce- bd51-eedf4b103b eb tempest- TestShelveInsta nce-1466639456 tempest- TestShelveInsta nce-1466639456] [instance: 5ffada24- 45ad-44ee- 87c7-d2b69ce3d0 ba] Shelve offloading
And unshelved on the subnode:
https:/ /zuul.opendev. org/t/openstack /build/ 1a340abeb898414 09fd9940f5af313 48/log/ compute1/ logs/screen- n-cpu.txt. gz#27280
Nov 21 04:16:30.333184 ubuntu- bionic- ovh-bhs1- 0012956851 nova-compute[ 20130]: INFO nova.compute. manager [None req-f799e533- 11c4-42a0- bb60-72af77f310 62 tempest- TestShelveInsta nce-1466639456 tempest- TestShelveInsta nce-1466639456] [instance: 5ffada24- 45ad-44ee- 87c7-d2b69ce3d0 ba] Unshelving