test_shelve_volume_backed_instance intermittently fails guest ssh with dhcp lease fail

Bug #1853453 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Fix Released
Undecided
Unassigned

Bug Description

Seen here but I've seen this a lot in the last few weeks:

https://zuul.opendev.org/t/openstack/build/1a340abeb89841409fd9940f5af31348/log/job-output.txt#79242

We have bug 1848078 for the generic guest dhcp lease failure causing ssh to fail, but test_shelve_volume_backed_instance seems particularly prone to hitting it so I'm opening this separate bug to see if there is something about that test that could be triggering this more often.

2019-11-21 04:20:16.499344 | controller | {2} tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance [306.264621s] ... FAILED
2019-11-21 04:20:16.499446 | controller |
2019-11-21 04:20:16.499509 | controller | Captured traceback:
2019-11-21 04:20:16.499542 | controller | ~~~~~~~~~~~~~~~~~~~
2019-11-21 04:20:16.499588 | controller | b'Traceback (most recent call last):'
2019-11-21 04:20:16.499678 | controller | b' File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection'
2019-11-21 04:20:16.499712 | controller | b' sock=proxy_chan)'
2019-11-21 04:20:16.499835 | controller | b' File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/paramiko/client.py", line 368, in connect'
2019-11-21 04:20:16.499889 | controller | b' raise NoValidConnectionsError(errors)'
2019-11-21 04:20:16.499994 | controller | b'paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.106'
2019-11-21 04:20:16.500016 | controller | b''
2019-11-21 04:20:16.500090 | controller | b'During handling of the above exception, another exception occurred:'
2019-11-21 04:20:16.500108 | controller | b''
2019-11-21 04:20:16.500153 | controller | b'Traceback (most recent call last):'
2019-11-21 04:20:16.500237 | controller | b' File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 89, in wrapper'
2019-11-21 04:20:16.500292 | controller | b' return f(*func_args, **func_kwargs)'
2019-11-21 04:20:16.500407 | controller | b' File "/opt/stack/tempest/tempest/scenario/test_shelve_instance.py", line 93, in test_shelve_volume_backed_instance'
2019-11-21 04:20:16.500484 | controller | b' self._create_server_then_shelve_and_unshelve(boot_from_volume=True)'
2019-11-21 04:20:16.500622 | controller | b' File "/opt/stack/tempest/tempest/scenario/test_shelve_instance.py", line 76, in _create_server_then_shelve_and_unshelve'
2019-11-21 04:20:16.500666 | controller | b' server=server)'
2019-11-21 04:20:16.500753 | controller | b' File "/opt/stack/tempest/tempest/scenario/manager.py", line 771, in get_timestamp'
2019-11-21 04:20:16.500785 | controller | b' server=server)'
2019-11-21 04:20:16.500874 | controller | b' File "/opt/stack/tempest/tempest/scenario/manager.py", line 501, in get_remote_client'
2019-11-21 04:20:16.500936 | controller | b' linux_client.validate_authentication()'
2019-11-21 04:20:16.501034 | controller | b' File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper'
2019-11-21 04:20:16.501086 | controller | b' six.reraise(*original_exception)'
2019-11-21 04:20:16.501184 | controller | b' File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/six.py", line 696, in reraise'
2019-11-21 04:20:16.501214 | controller | b' raise value'
2019-11-21 04:20:16.501310 | controller | b' File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper'
2019-11-21 04:20:16.501407 | controller | b' return function(self, *args, **kwargs)'
2019-11-21 04:20:16.501531 | controller | b' File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication'
2019-11-21 04:20:16.501583 | controller | b' self.ssh_client.test_connection_auth()'
2019-11-21 04:20:16.501673 | controller | b' File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 207, in test_connection_auth'
2019-11-21 04:20:16.501726 | controller | b' connection = self._get_ssh_connection()'
2019-11-21 04:20:16.501819 | controller | b' File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection'
2019-11-21 04:20:16.501858 | controller | b' password=self.password)'
2019-11-21 04:20:16.501944 | controller | b'tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.106 via SSH timed out.'
2019-11-21 04:20:16.501985 | controller | b'User: cirros, Password: None'
2019-11-21 04:20:16.502002 | controller | b'

From the guest console output:

2019-11-21 04:20:16.674184 | controller | b'Starting network...'
2019-11-21 04:20:16.674234 | controller | b'udhcpc (v1.23.2) started'
2019-11-21 04:20:16.674279 | controller | b'Sending discover...'
2019-11-21 04:20:16.674326 | controller | b'Sending discover...'
2019-11-21 04:20:16.674369 | controller | b'Sending discover...'
2019-11-21 04:20:16.674431 | controller | b'Usage: /sbin/cirros-dhcpc <up|down>'
2019-11-21 04:20:16.674489 | controller | b'No lease, failing'
2019-11-21 04:20:16.674557 | controller | b'WARN: /etc/rc3.d/S40-network failed'

This is where the test fails:

https://github.com/openstack/tempest/blob/25572c3b5512aa3ecc949b926af1e1fd64fddfd9/tempest/scenario/test_shelve_instance.py#L74

So that's after the server is shelved and then unshelved. I wonder if the guest is dirty somehow when the snapshot is created. The create_timestamp method does run sync though on the guest:

https://github.com/openstack/tempest/blob/25572c3b5512aa3ecc949b926af1e1fd64fddfd9/tempest/scenario/manager.py#L657

Other things to think about with this is if it shows up mostly in multinode jobs, does the server get unshelved on the other compute node from the original one it was running on and would that make a difference? It shouldn't. Or is it something to do with slow nodes when this fails?

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22TestShelveInstance%3Atest_shelve_volume_backed_instance.%20Timeout%20trying%20to%20ssh%20to%20server%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1&from=7d

14 hits in 7 days on two multinode jobs.

I also wonder if https://review.opendev.org/#/c/670591/ might somehow cause issues with any test that is creating a snapshot like shelve tests where the guest is stopped, snapshot and then booted from that snapshot. Though I'd be surprised if a CirrOS guest is responding to a graceful shutdown request.

Also note that in this test we're using a volume-backed server so we're writing to the root volume. I wonder if that is getting flushed properly when we disconnect and reconnect the volume? And I wonder if the test would be more stable if we wrote the timestamp to a secondary attached blank volume rather than the root volume that contains the OS image.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Confirmed that in the case above the instance is created and then shelved on the primary node:

https://zuul.opendev.org/t/openstack/build/1a340abeb89841409fd9940f5af31348/log/controller/logs/screen-n-cpu.txt.gz#28281

Nov 21 04:16:25.005754 ubuntu-bionic-ovh-bhs1-0012956850 nova-compute[26306]: INFO nova.compute.manager [None req-8f552e87-036a-4cce-bd51-eedf4b103beb tempest-TestShelveInstance-1466639456 tempest-TestShelveInstance-1466639456] [instance: 5ffada24-45ad-44ee-87c7-d2b69ce3d0ba] Shelve offloading

And unshelved on the subnode:

https://zuul.opendev.org/t/openstack/build/1a340abeb89841409fd9940f5af31348/log/compute1/logs/screen-n-cpu.txt.gz#27280

Nov 21 04:16:30.333184 ubuntu-bionic-ovh-bhs1-0012956851 nova-compute[20130]: INFO nova.compute.manager [None req-f799e533-11c4-42a0-bb60-72af77f31062 tempest-TestShelveInstance-1466639456 tempest-TestShelveInstance-1466639456] [instance: 5ffada24-45ad-44ee-87c7-d2b69ce3d0ba] Unshelving

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh right since this is a volume-backed server there is no snapshot created when we shelve offload the server.

Revision history for this message
Matt Riedemann (mriedem) wrote :

When the server is unshelved on the subnode os-brick does connect the root volume to sda:

Nov 21 04:16:36.454669 ubuntu-bionic-ovh-bhs1-0012956851 nova-compute[20130]: DEBUG os_brick.initiator.connectors.iscsi [None req-f799e533-11c4-42a0-bb60-72af77f31062 tempest-TestShelveInstance-1466639456 tempest-TestShelveInstance-1466639456] Connected to sda using {'target_discovered': False, 'target_portal': '158.69.75.166:3260', 'target_iqn': 'iqn.2010-10.org.openstack:volume-460e5933-41f1-49d5-af98-e370f17c9b2c', 'target_lun': 1, 'volume_id': '460e5933-41f1-49d5-af98-e370f17c9b2c', 'auth_method': 'CHAP', 'auth_username': 'aHxieGHRaV2A4sobCcX3', 'auth_password': '***', 'encrypted': False, 'qos_specs': None, 'access_mode': 'rw'} {{(pid=20130) _connect_vol /usr/local/lib/python3.6/dist-packages/os_brick/initiator/connectors/iscsi.py:667}}

Revision history for this message
Matt Riedemann (mriedem) wrote :

I see that a config drive isn't being used in the nova-next job:

Nov 21 03:42:29.864665 ubuntu-bionic-ovh-bhs1-0012956850 nova-compute[26306]: DEBUG oslo_service.service [None req-9684e5dd-2bc1-438c-9435-911dba583a9c None None] force_config_drive = False {{(pid=26306) log_opt_values /usr/local/lib/python3.6/dist-packages/oslo_config/cfg.py:2577}}

That might have something to do with the guest setup failures? Meaning trying to get back to the metadata service might be timing out?

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/695431 is a nova change to make the nova-next job force the use of config drive to take the metadata API server out of the mix in case that is causing issues between the subnode and the primary during guest creation.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Martin Kopec (mkopec) wrote :
Revision history for this message
Paras Babbar (pbabbar) wrote :

Martin, Yeah I would say we should close this bug as fixed .

Revision history for this message
Martin Kopec (mkopec) wrote :

Based on the previous comments, moving the state to Fix Released.

Changed in tempest:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.