SSH to guest sometimes fails publickey authentication: AuthenticationException: Authentication failed.

Bug #1911574 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Gate
New
Undecided
Unassigned

Bug Description

Seen in the gate today in the tempest-slow job [1]:

2021-01-13 05:10:39.881333 | controller | 2021-01-13 05:10:23,697 8834 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.152 after 17 attempts
2021-01-13 05:10:39.881352 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2021-01-13 05:10:39.881370 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2021-01-13 05:10:39.881392 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh sock=proxy_chan)
2021-01-13 05:10:39.881411 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "/opt/stack/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 424, in connect
2021-01-13 05:10:39.881429 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh passphrase,
2021-01-13 05:10:39.881447 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh File "/opt/stack/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 714, in _auth
2021-01-13 05:10:39.881465 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh raise saved_exception
2021-01-13 05:10:39.881483 | controller | 2021-01-13 05:10:23.697 8834 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.

Logstash query:

44 hits in the last 7 days, but only 3 unique changes. All failures

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AuthenticationException%3A%20Authentication%20failed.%5C%22%20AND%20tags%3Aconsole&from=7d

It looks like there are a variety of different messages in the guest console output indicating why ssh auth ended up failing, depending on how far it got. All have to do with failure to retrieve data from the metadata service.

Here are some examples. First [2]:

2021-01-07 16:15:28.308383 | controller | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-07 16:15:28.308395 | controller | cirros-ds 'net' up at 11.93
2021-01-07 16:15:28.308405 | controller | checking http://169.254.169.254/2009-04-04/instance-id
2021-01-07 16:15:28.308414 | controller | successful after 1/20 tries: up 12.27. iid=i-00000082
2021-01-07 16:15:28.308423 | controller | failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys
2021-01-07 16:15:28.308432 | controller | warning: no ec2 metadata for public-keys
2021-01-07 16:15:28.308441 | controller | failed to get http://169.254.169.254/2009-04-04/user-data
2021-01-07 16:15:28.308450 | controller | warning: no ec2 metadata for user-data
2021-01-07 16:15:28.308460 | controller | found datasource (ec2, net)
2021-01-07 16:15:28.308469 | controller | Top of dropbear init script
2021-01-07 16:15:28.308478 | controller | Starting dropbear sshd: WARN: generating key of type ecdsa failed!

This shows it successfully get the instance id from the metadata service, but then it fails to get the public keys (which I think causes the failure to ssh).

Second [3]:

2021-01-12 17:47:52.961860 | primary | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-12 17:47:52.961875 | primary | cirros-ds 'net' up at 9.04
2021-01-12 17:47:52.961890 | primary | checking http://169.254.169.254/2009-04-04/instance-id
2021-01-12 17:47:52.961904 | primary | failed 1/20: up 9.32. request failed
2021-01-12 17:47:52.961919 | primary | failed 2/20: up 11.83. request failed
2021-01-12 17:47:52.961933 | primary | failed 3/20: up 14.15. request failed
2021-01-12 17:47:52.961948 | primary | failed 4/20: up 16.43. request failed
2021-01-12 17:47:52.961963 | primary | failed 5/20: up 18.84. request failed
2021-01-12 17:47:52.961977 | primary | failed 6/20: up 21.30. request failed
2021-01-12 17:47:52.961992 | primary | failed 7/20: up 23.73. request failed
2021-01-12 17:47:52.962007 | primary | failed 8/20: up 26.06. request failed
2021-01-12 17:47:52.962021 | primary | failed 9/20: up 28.41. request failed
2021-01-12 17:47:52.962048 | primary | failed 10/20: up 30.77. request failed
2021-01-12 17:47:52.962065 | primary | failed 11/20: up 32.98. request failed
2021-01-12 17:47:52.962080 | primary | failed 12/20: up 35.33. request failed
2021-01-12 17:47:52.962095 | primary | failed 13/20: up 37.77. request failed
2021-01-12 17:47:52.962109 | primary | failed 14/20: up 40.18. request failed
2021-01-12 17:47:52.962124 | primary | failed 15/20: up 42.42. request failed
2021-01-12 17:47:52.962139 | primary | failed 16/20: up 44.71. request failed
2021-01-12 17:47:52.962153 | primary | failed 17/20: up 47.16. request failed
2021-01-12 17:47:52.962168 | primary | failed 18/20: up 49.63. request failed
2021-01-12 17:47:52.962183 | primary | failed 19/20: up 52.07. request failed
2021-01-12 17:47:52.962197 | primary | failed 20/20: up 54.46. request failed
2021-01-12 17:47:52.962212 | primary | failed to read iid from metadata. tried 20
2021-01-12 17:47:52.962227 | primary | no results found for mode=net. up 56.78. searched: nocloud configdrive ec2
2021-01-12 17:47:52.962241 | primary | failed to get instance-id of datasource
2021-01-12 17:47:52.962256 | primary | Top of dropbear init script
2021-01-12 17:47:52.962278 | primary | Starting dropbear sshd: failed to get instance-id of datasource
2021-01-12 17:47:52.962292 | primary | WARN: generating key of type ecdsa failed!

This shows it fail to get the instance id from the metadata service, and then it doesn't try to get the public keys.

Third [1]:

2021-01-13 05:10:39.886171 | controller | WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
2021-01-13 05:10:39.886185 | controller | cirros-ds 'net' up at 6.93
2021-01-13 05:10:39.886195 | controller | checking http://169.254.169.254/2009-04-04/instance-id
2021-01-13 05:10:39.886204 | controller | successful after 1/20 tries: up 7.25. iid=i-0000002c
2021-01-13 05:10:39.886213 | controller | failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key
2021-01-13 05:10:39.886223 | controller | warning: no ec2 metadata for public-keys
2021-01-13 05:10:39.886232 | controller | failed to get http://169.254.169.254/2009-04-04/user-data
2021-01-13 05:10:39.886248 | controller | warning: no ec2 metadata for user-data
2021-01-13 05:10:39.886258 | controller | found datasource (ec2, net)
2021-01-13 05:10:39.886267 | controller | Top of dropbear init script
2021-01-13 05:10:39.886276 | controller | Starting dropbear sshd: WARN: generating key of type ecdsa failed!

This shows it fail to get a specific public key from the metadata service.

In all of these jobs, force_config_drive is not set in the nova-cpu_conf.txt (and it defaults to False), which is why it's going to the metadata service for data.

[1] https://zuul.opendev.org/t/openstack/build/9c4c5ca5e99a4becb2f24762972bc8a0/log/job-output.txt#73303
[2] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b18/767107/4/gate/tempest-full/b18c480/job-output.txt
[3] https://ea3a162b13104dc2d966-37d30e8ca4786c547186bac4ea4dd159.ssl.cf5.rackcdn.com/770429/1/check/networking-ovn-dsvm-grenade/f35752c/job-output.txt
[4]

Revision history for this message
melanie witt (melwitt) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.