Bug #1349617 “SSHException: Error reading SSH protocol banner[Er...” : Bugs : grenade

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-28:

#1

e-r query: https://review.openstack.org/#/c/110164/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-28:

#2

Huh, I see this in the n-net logs:

2014-07-27 20:11:47.967 DEBUG nova.network.manager [req-802f7e4b-3989-4343-94d0-849cefdb64aa TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Allocated fixed ip None on network 27dd907f-ec5f-4e9e-b369-a5a3b6bd13fa allocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:925

Notice the None, that seems odd...

I do see this later:

2014-07-27 20:12:16.240 DEBUG nova.network.manager [req-94127694-71f3-46d2-a62c-118a4d1556cb TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Network deallocation for instance deallocate_for_instance /opt/stack/new/nova/nova/network/manager.py:561
2014-07-27 20:12:16.279 DEBUG nova.network.manager [req-94127694-71f3-46d2-a62c-118a4d1556cb TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Deallocate fixed ip 10.1.0.3 deallocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:946

So when was the fixed IP actually allocated, or is that just a logging bug?

tags:

added: network

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-29:

#3

Maybe bug 1349590 is related, that's a nova-network issue with floating IPs.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-29: Related fix proposed to nova (master)

#4

Related fix proposed to branch: master
Review: https://review.openstack.org/110384

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-07-29: Re: test_volume_boot_pattern fails in grenade with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

#5

Related change to nova to add more trace logging to the allocate_fixed_ip method in NetworkManager:

https://review.openstack.org/#/c/110384/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-07-31: Related fix merged to nova (master)

#6

Reviewed: https://review.openstack.org/110384
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a4c580ff03f4abb03970dd6de315ca0ba6849617
Submitter: Jenkins
Branch: master

commit a4c580ff03f4abb03970dd6de315ca0ba6849617
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 29 10:18:13 2014 -0700

Add trace logging to allocate_fixed_ip

    The address is being logged as None in some cases
    that are failing in grenade jobs so this adds more
    trace logging to the base network manager's
    allocate_fixed_ip method so we can see which paths
    are being taken in the code and what the outputs
    are.

Change-Id: I37de4b3bbb9e51b57eb4d048e05fc00382eed23d
Related-Bug: #1349617

Revision history for this message

jiang, yunhong (yunhong-jiang) wrote on 2014-08-18: Re: test_volume_boot_pattern fails in grenade with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

#7

Download full text (12.3 KiB)

I hit a similar issue, but a bit different, in http://logs.openstack.org/53/76053/16/check/check-grenade-dsvm-partial-ncpu/5a53b07/console.html#_2014-08-18_16_36_31_962 . Seems sometime it will fail to connected, sometime it fail to get the banner.

2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,400 8863 INFO [tempest.common.ssh] Creating ssh connection to '172.24.4.1' as 'cirros' with public key authentication
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,412 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,589 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,591 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 1. Retry after 2 seconds.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,101 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,273 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,276 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 2. Retry after 3 seconds.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:09,786 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:09,961 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:09,963 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 3. Retry after 4 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,475 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,645 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,649 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 4. Retry after 5 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,161 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,331 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,335 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 5. Retry after 6 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:26,847 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:27,018 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 | 2014-08-18 16:33:27,020 8863 WARNING [tem...

I hit a similar issue, but a bit different, in http://logs.openstack.org/53/76053/16/check/check-grenade-dsvm-partial-ncpu/5a53b07/console.html#_2014-08-18_16_36_31_962 . Seems sometime it will fail to connected, sometime it fail to get the banner.

2014-08-18 16:36:31.962 |     2014-08-18 16:33:03,400 8863 INFO     [tempest.common.ssh] Creating ssh connection to '172.24.4.1' as 'cirros' with public key authentication
2014-08-18 16:36:31.962 |     2014-08-18 16:33:03,412 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 |     2014-08-18 16:33:03,589 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 |     2014-08-18 16:33:03,591 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 1. Retry after 2 seconds.
2014-08-18 16:36:31.962 |     2014-08-18 16:33:06,101 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 |     2014-08-18 16:33:06,273 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 |     2014-08-18 16:33:06,276 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 2. Retry after 3 seconds.
2014-08-18 16:36:31.962 |     2014-08-18 16:33:09,786 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 |     2014-08-18 16:33:09,961 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:09,963 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 3. Retry after 4 seconds.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:14,475 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 |     2014-08-18 16:33:14,645 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:14,649 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 4. Retry after 5 seconds.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:20,161 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 |     2014-08-18 16:33:20,331 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:20,335 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 5. Retry after 6 seconds.
2014-08-18 16:36:31.963 |     2014-08-18 16:33:26,847 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 |     2014-08-18 16:33:27,018 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:27,020 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 6. Retry after 7 seconds.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:34,530 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.964 |     2014-08-18 16:33:34,700 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:34,701 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 7. Retry after 8 seconds.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:43,216 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.964 |     2014-08-18 16:33:43,421 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:43,422 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 8. Retry after 9 seconds.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:52,932 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.964 |     2014-08-18 16:33:53,135 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 |     2014-08-18 16:33:53,136 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 9. Retry after 10 seconds.
2014-08-18 16:36:31.964 |     2014-08-18 16:34:03,645 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.965 |     2014-08-18 16:34:03,814 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.965 |     2014-08-18 16:34:03,815 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 10. Retry after 11 seconds.
2014-08-18 16:36:31.965 |     2014-08-18 16:34:15,332 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.965 |     2014-08-18 16:34:15,504 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.965 |     2014-08-18 16:34:15,505 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 11. Retry after 12 seconds.
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,019 8863 ERROR    [paramiko.transport] Exception: Error reading SSH protocol banner
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,021 8863 ERROR    [paramiko.transport] Traceback (most recent call last):
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,021 8863 ERROR    [paramiko.transport]   File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1412, in run
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,021 8863 ERROR    [paramiko.transport]     self._check_banner()
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,021 8863 ERROR    [paramiko.transport]   File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1539, in _check_banner
2014-08-18 16:36:31.965 |     2014-08-18 16:34:28,021 8863 ERROR    [paramiko.transport]     raise SSHException('Error reading SSH protocol banner' + str(e))
2014-08-18 16:36:31.966 |     2014-08-18 16:34:28,022 8863 ERROR    [paramiko.transport] SSHException: Error reading SSH protocol banner
2014-08-18 16:36:31.966 |     2014-08-18 16:34:28,022 8863 ERROR    [paramiko.transport] 
2014-08-18 16:36:31.966 |     2014-08-18 16:34:28,026 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Error reading SSH protocol banner). Number attempts: 12. Retry after 13 seconds.
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,538 8863 ERROR    [paramiko.transport] Exception: Error reading SSH protocol banner
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,539 8863 ERROR    [paramiko.transport] Traceback (most recent call last):
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,539 8863 ERROR    [paramiko.transport]   File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1412, in run
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,539 8863 ERROR    [paramiko.transport]     self._check_banner()
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,539 8863 ERROR    [paramiko.transport]   File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 1539, in _check_banner
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,539 8863 ERROR    [paramiko.transport]     raise SSHException('Error reading SSH protocol banner' + str(e))
2014-08-18 16:36:31.966 |     2014-08-18 16:34:41,540 8863 ERROR    [paramiko.transport] SSHException: Error reading SSH protocol banner
2014-08-18 16:36:31.967 |     2014-08-18 16:34:41,540 8863 ERROR    [paramiko.transport] 
2014-08-18 16:36:31.967 |     2014-08-18 16:34:41,542 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Error reading SSH protocol banner). Number attempts: 13. Retry after 14 seconds.
2014-08-18 16:36:31.967 |     2014-08-18 16:34:56,055 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.967 |     2014-08-18 16:34:56,228 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.967 |     2014-08-18 16:34:56,228 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 14. Retry after 15 seconds.
2014-08-18 16:36:31.967 |     2014-08-18 16:35:11,751 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.967 |     2014-08-18 16:35:11,918 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.967 |     2014-08-18 16:35:11,919 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 15. Retry after 16 seconds.
2014-08-18 16:36:31.967 |     2014-08-18 16:35:28,442 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.967 |     2014-08-18 16:35:28,613 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.967 |     2014-08-18 16:35:28,618 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 16. Retry after 17 seconds.
2014-08-18 16:36:31.968 |     2014-08-18 16:35:46,142 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.968 |     2014-08-18 16:35:46,310 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.968 |     2014-08-18 16:35:46,312 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 17. Retry after 18 seconds.
2014-08-18 16:36:31.968 |     2014-08-18 16:36:04,837 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.968 |     2014-08-18 16:36:05,006 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.968 |     2014-08-18 16:36:05,008 8863 WARNING  [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 18. Retry after 19 seconds.
2014-08-18 16:36:31.968 |     2014-08-18 16:36:24,523 8863 INFO     [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.968 |     2014-08-18 16:36:24,697 8863 INFO     [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.968 |     2014-08-18 16:36:24,700 8863 ERROR    [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 after 18 attempts
2014-08-18 16:36:31.968 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh Traceback (most recent call last):
2014-08-18 16:36:31.968 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh   File "tempest/common/ssh.py", line 76, in _get_ssh_connection
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh     timeout=self.channel_timeout, pkey=self.pkey)
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh   File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 273, in connect
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh     self._auth(username, password, pkey, key_filenames, allow_agent, look_for_keys)
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh   File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 456, in _auth
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh     raise saved_exception
2014-08-18 16:36:31.969 |     2014-08-18 16:36:24.700 8863 TRACE tempest.common.ssh AuthenticationException: Authentication failed.

Salvatore Orlando (salvatore-orlando) on 2014-08-24

Changed in neutron:
importance:	Undecided → High
assignee:	nobody → Salvatore Orlando (salvatore-orlando)
importance:	High → Critical
milestone:	none → juno-3

Salvatore Orlando (salvatore-orlando) on 2014-08-24

Changed in neutron:
importance:	Critical → High

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-08-24:

#8

I noticed a rather high correlation between messages in console log pertaining cirros data sources and this kind of failure.

The messages are:
EC2 Metadata: "/run/cirros/datasource/data/user-data was not '#!' or executable" (executed on neutron PG jobs)
Config drive: no userdata for datasource"

I've tried to nail down the correlation with the query [1]. Despite being horrible I think it does a good job in showing that when this kind of SSH error occurs, the cirros data source message appears as well. Note that query also captures grenade-partial-ncpu failures.

This seems to point to a metadata problem. It could be race, but I have no other element so far.
It is possible the metadata server provides invalid user data (SSH keys in this case) to the instance. And similar during spawn the wrong info would be put in the config driver. However, I've not yet found anything to support this hypothesis.
The clues we have so far point at SSH keys as the connection is successfully established (and ping tests passes too). However, there are authentication failures in the log [2]. The SSH protocol banner error probably appears only once the cirros instance stops accepting ssh connection requests.

One weird thing to note is that [3] was pushed to revert a change which introduced this exact same issue. And the reverted change at the end of the day just added some delay to the tests.

[1] http://logstash.openstack.org/#eyJzZWFyY2giOiIoKG1lc3NhZ2U6XCIvcnVuL2NpcnJvcy9kYXRhc291cmNlL2RhdGEvdXNlci1kYXRhIHdhcyBub3QgJyMhJyBvciBleGVjdXRhYmxlXCIgT1IgbWVzc2FnZTpcIm5vIHVzZXJkYXRhIGZvciBkYXRhc291cmNlXCIpICBBTkQgTk9UIG1lc3NhZ2U6XCJSRVNQIEJPRFlcIiAgQU5EIE5PVCBidWlsZF9uYW1lOmNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWR2ciBBTkQgTk9UIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pY2Vob3VzZVwiIEFORCBOT1QgYnVpbGRfbmFtZTpcImNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWZ1bGwtaWNlaG91c2VcIiBBTkQgdGFnczpjb25zb2xlKSBPUiAobWVzc2FnZTpcIlRSQUNFXCIgQU5EIG1lc3NhZ2U6XCJTU0hFeGNlcHRpb246IEVycm9yIHJlYWRpbmcgU1NIIHByb3RvY29sIGJhbm5lclwiIEFORCB0YWdzOlwiY29uc29sZVwiKSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwODkyMzMyODM0OSwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[2] http://logs.openstack.org/08/111008/12/check/check-grenade-dsvm-partial-ncpu/cc59cae/console.html#_2014-08-24_11_20_41_834
[3] https://review.openstack.org/#/c/97245/

I noticed a rather high correlation between messages in console log pertaining cirros data sources and this kind of failure.

The messages are:
EC2 Metadata: "/run/cirros/datasource/data/user-data was not '#!' or executable"  (executed on neutron PG jobs)
Config drive: no userdata for datasource"

I've tried to nail down the correlation with the query [1]. Despite being horrible I think it does a good job in showing that when this kind of SSH error occurs, the cirros data source message appears as well. Note that query also captures grenade-partial-ncpu failures.

This seems to point to a metadata problem. It could be race, but I have no other element so far.
It is possible the metadata server provides invalid user data (SSH keys in this case) to the instance. And similar during spawn the wrong info would be put in the config driver. However, I've not yet found anything to support this hypothesis.
The clues we have so far point at SSH keys as the connection is successfully established (and ping tests passes too). However, there are authentication failures in the log [2]. The SSH protocol banner error probably appears only once the cirros instance stops accepting ssh connection requests.

One weird thing to note is that [3] was pushed to revert a change which introduced this exact same issue. And the reverted change at the end of the day just added some delay to the tests.

[1] http://logstash.openstack.org/#eyJzZWFyY2giOiIoKG1lc3NhZ2U6XCIvcnVuL2NpcnJvcy9kYXRhc291cmNlL2RhdGEvdXNlci1kYXRhIHdhcyBub3QgJyMhJyBvciBleGVjdXRhYmxlXCIgT1IgbWVzc2FnZTpcIm5vIHVzZXJkYXRhIGZvciBkYXRhc291cmNlXCIpICBBTkQgTk9UIG1lc3NhZ2U6XCJSRVNQIEJPRFlcIiAgQU5EIE5PVCBidWlsZF9uYW1lOmNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWR2ciBBTkQgTk9UIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pY2Vob3VzZVwiIEFORCBOT1QgYnVpbGRfbmFtZTpcImNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWZ1bGwtaWNlaG91c2VcIiBBTkQgdGFnczpjb25zb2xlKSBPUiAobWVzc2FnZTpcIlRSQUNFXCIgQU5EIG1lc3NhZ2U6XCJTU0hFeGNlcHRpb246IEVycm9yIHJlYWRpbmcgU1NIIHByb3RvY29sIGJhbm5lclwiIEFORCB0YWdzOlwiY29uc29sZVwiKSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwODkyMzMyODM0OSwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[2] http://logs.openstack.org/08/111008/12/check/check-grenade-dsvm-partial-ncpu/cc59cae/console.html#_2014-08-24_11_20_41_834
[3] https://review.openstack.org/#/c/97245/

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-08-28:

#9

Just noticed a similar SSH time outs with "check-grenade-dsvm-partial-ncpu'" test job[1] from test 'tempest/scenario/test_snapshot_pattern.py':

-------------
.
.
2014-08-27 08:28:47.776 | 2014-08-27 08:28:41,120 9490 INFO [tempest.common.debug] Host ns list[]
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41,121 9490 ERROR [tempest.scenario.test_snapshot_pattern] Initializing SSH connection failed
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern Traceback (most recent call last):
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/scenario/test_snapshot_pattern.py", line 52, in _ssh_to_server
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern return self.get_remote_client(server_or_ip)
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/scenario/manager.py", line 332, in get_remote_client
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern linux_client.validate_authentication()
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/utils/linux/remote_client.py", line 54, in validate_authentication
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern self.ssh_client.test_connection_auth()
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/ssh.py", line 151, in test_connection_auth
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern connection = self._get_ssh_connection()
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern password=self.password)
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2014-08-27 08:28:47.781 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern User: cirros, Password: None
2014-08-27 08:28:47.781 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern
.
.
-------------

[1] http://logs.openstack.org/04/117104/2/check/check-grenade-dsvm-partial-ncpu/d3829fe/console.html

Just noticed a similar SSH time outs with "check-grenade-dsvm-partial-ncpu'" test job[1] from test 'tempest/scenario/test_snapshot_pattern.py':

-------------
.
.
2014-08-27 08:28:47.776 |     2014-08-27 08:28:41,120 9490 INFO     [tempest.common.debug] Host ns list[]
2014-08-27 08:28:47.777 |     2014-08-27 08:28:41,121 9490 ERROR    [tempest.scenario.test_snapshot_pattern] Initializing SSH connection failed
2014-08-27 08:28:47.777 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern Traceback (most recent call last):
2014-08-27 08:28:47.777 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern   File "tempest/scenario/test_snapshot_pattern.py", line 52, in _ssh_to_server
2014-08-27 08:28:47.777 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern     return self.get_remote_client(server_or_ip)
2014-08-27 08:28:47.778 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern   File "tempest/scenario/manager.py", line 332, in get_remote_client
2014-08-27 08:28:47.778 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern     linux_client.validate_authentication()
2014-08-27 08:28:47.778 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern   File "tempest/common/utils/linux/remote_client.py", line 54, in validate_authentication
2014-08-27 08:28:47.779 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern     self.ssh_client.test_connection_auth()
2014-08-27 08:28:47.779 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern   File "tempest/common/ssh.py", line 151, in test_connection_auth
2014-08-27 08:28:47.779 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern     connection = self._get_ssh_connection()
2014-08-27 08:28:47.780 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern   File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2014-08-27 08:28:47.780 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern     password=self.password)
2014-08-27 08:28:47.780 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2014-08-27 08:28:47.781 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern User: cirros, Password: None
2014-08-27 08:28:47.781 |     2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern 
.
.
-------------

[1] http://logs.openstack.org/04/117104/2/check/check-grenade-dsvm-partial-ncpu/d3829fe/console.html

Revision history for this message

melanie witt (melwitt) wrote on 2014-08-29:

#10

Here is what I’ve gathered so far. I looked through a few failed builds and focused on one [0] that uses the metadata service rather than config drive as it gives more clues.

1. The messages about “userdata” in the guest console don’t seem related to the failure i.e. the guest console only shows up in the logs if the build fails. I think it always says "/run/cirros/datasource/data/user-data was not '#!' or executable" or “no userdata for datasource" if no “userdata” is being used, and none is. The ssh keys are part of the metadata in these tests, not the userdata portion of the metadata.

2. In the metadata service log [1], there are zero calls to e.g. "GET /2009-04-04/meta-data/user-data HTTP/1.1" further supporting no userdata relationship.

3. Ssh keys are added to the metadata in nova/api/metadata.py by nova itself, so it appears unlikely there is anything wrong there, or at least I didn’t see anything unusual. The key is created by a POST to nova [2] and nova creates the key. The key content then appears several times in the log messages of the metadata service (it seems fine, uncorrupted).

4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).

5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].

Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.

[0] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83
[1] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/logs/screen-q-meta.txt.gz
[2] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_546
[3] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_659
[4] https://ask.openstack.org/en/question/32958/unable-to-ssh-with-key-pair/
[5] https://bugs.launchpad.net/cirros/+bug/1301958

Here is what I’ve gathered so far. I looked through a few failed builds and focused on one [0] that uses the metadata service rather than config drive as it gives more clues.

1. The messages about “userdata” in the guest console don’t seem related to the failure i.e. the guest console only shows up in the logs if the build fails. I think it always says "/run/cirros/datasource/data/user-data was not '#!' or executable" or “no userdata for datasource" if no “userdata” is being used, and none is. The ssh keys are part of the metadata in these tests, not the userdata portion of the metadata.

2. In the metadata service log [1], there are zero calls to e.g. "GET /2009-04-04/meta-data/user-data HTTP/1.1" further supporting no userdata relationship.

3. Ssh keys are added to the metadata in nova/api/metadata.py by nova itself, so it appears unlikely there is anything wrong there, or at least I didn’t see anything unusual. The key is created by a POST to nova [2] and nova creates the key. The key content then appears several times in the log messages of the metadata service (it seems fine, uncorrupted).

4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).

5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].

Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.

[0] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83
[1] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/logs/screen-q-meta.txt.gz
[2] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_546
[3] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_659
[4] https://ask.openstack.org/en/question/32958/unable-to-ssh-with-key-pair/
[5] https://bugs.launchpad.net/cirros/+bug/1301958

Attila Fazekas (afazekas) on 2014-09-02

summary:

- test_volume_boot_pattern fails in grenade with "SSHException: Error
- reading SSH protocol banner[Errno 104] Connection reset by peer"
+ test_volume_boot_pattern fails with "SSHException: Error reading SSH
+ protocol banner[Errno 104] Connection reset by peer"

Joe Gordon (jogo) on 2014-09-02

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Critical

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-02: Re: test_volume_boot_pattern fails with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

#11

I think we should focus on two aspects:
1) Ping works otherwise we won't get to SSH test
2) SSH connections shows always authentication failures before 'SSH protocol banner' errors.

I don't know about the MTU possibility, but I wouldn't expect it to happen on single host tests.

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-02:

#12

I was thinking maybe the auth failure might happen before the guest reads the public key from metadata, then after it reads a corrupted key, it keeps sending back a truncated or otherwise invalid data in response to the SSH connection request. I read more about the paramiko error "Error reading SSH protocol banner[Errno 104]" and it can also mean the remote host didn't send a banner at all (not responding at all, like Salvatore mentioned in comment #10).

I combed logs some more and didn't find anything useful so I'm now going to try to reproduce the issue locally using devstack. I'd like to see the logs inside the guest (sshd logs, etc) after this happens. Which makes me wonder if we could add something to tempest to mount the guest disk if ssh failure like this happens and capture some of the guest logs for debugging.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-02:

#13

Melanie,

we have been discussing this issue in openstack-qa.
since we too have been unable to find any evidence regarding issues with user data, we're going to validate the MTU hypothesis you made.

I'm going to push a patch to match it to cirros' MTU in the gate.
On the other hand a new patches cirros build with the fix for the bug you pointed out will be released soon.

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-02:

#14

Salvatore,

Okay. I agree MTU seems unlikely to be the issue but I'm glad if we can rule it out for sure.

Do you think we could do a verbose ssh in the tempest test (like ssh -vvv) to see the details of the exchange when the failure happens?

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-02:

#15

I don't think paramiko allow us to do that. Bypassing paramiko in tempest is too much code churn I think.

I will try to reproduce in a local environment. it should not be too hard as I can intercept this failure also on VMware NSX-CI.

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-02:

#16

@Salvatore,

It looks like paramiko.util.log_to_file(filename, level=10) [0] could do the verbose information -- it might be nice if we can have it as a separate log file e.g. "paramiko.txt" that goes alongside syslog.txt etc. The default log level 10 is debug.

[0] http://www.lag.net/paramiko/docs/paramiko.util-module.html#log_to_file

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-03:

#17

Thanks for the pointer melanie. I'll see locally first how hard it would be and whether it requires changes on the infrastructure side. This is debugging info worth having (unlike the pedant namespace info we dump which I never find useful).

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-03:

#18

Cool. :) I'm trying some things locally in tempest too to see what happens when I call the log_to_file function. If I get something working in tempest, I'll put up a patch (if you haven't already found a way).

Salvatore Orlando (salvatore-orlando) on 2014-09-03

summary:

- test_volume_boot_pattern fails with "SSHException: Error reading SSH
- protocol banner[Errno 104] Connection reset by peer"
+ SSHException: Error reading SSH protocol banner[Errno 104] Connection
+ reset by peer

Thierry Carrez (ttx) on 2014-09-03

Changed in neutron:
milestone:	juno-3 → juno-rc1

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-04:

#19

FYI I have just put up this patch to send paramiko logs to file during tempest runs:

https://review.openstack.org/#/c/118946

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-04:

#20

Thanks melanie - that's good stuff to have.
I have a few local repro environments locally when I'm running a tweaked tempest that will not destroy the vm to which the SSH connection failed.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-04:

#21

I reproduced the failure and I can confirm I have no authorized_keys file in the failing instance.
To reproduce the failure it is sufficient to start an instance with 4 cores and 8GB of memory, launch devstack with a localrc very similar to that of the full neutron test, and then keep running scenario tests.

A tweak for not removing the instance where ssh fails helps a lot: http://paste.openstack.org/show/105982/

Revision history for this message

melanie witt (melwitt) wrote on 2014-09-04:

#22

Awesome Salvatore, thanks for sharing that patch.

So it's running the latest Cirros 0.3.2 which I see fixed some bugs related to getting metadata [1]. Do you see anything interesting in /var/log/cloud-init.log in the VM?

[1] https://launchpad.net/cirros/trunk/0.3.2

Salvatore Orlando (salvatore-orlando) on 2014-09-04

Changed in tempest:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-05:

#23

Download full text (5.4 KiB)

So this is what I found out.

Instance log from a failing istance [1]. The important bit there is "cirros-apply-local already run per instance", and not "no userdata for datasource" as initially thought. That was just me being stupid and thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console log for all SSH protocol banner failures [2]. the presence of duplicates makes it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH connection fails there is no authorized_keys file in /home/cirros/.ssh. This obviously explains the authentication failure. Whether the subsequent SSH protocol banner errors are due to the cited MTU problems or else it has to be clarified yet.
What is certain is that cirros processes the data source containing the public SSH key before starting sshd. So the auth failures cannot be due to the init process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis. This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh public keys.
It is called by the cirros-per scripts [4], which at the end of its execution writes a marker file [5]. The cirros-per process will terminate if when executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log:
[ 3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC (1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r-- 1 root root 35 Thu Sep 4 13:06:28 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r-- 1 root root 22 Thu Sep 4 13:05:07 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r-- 1 root root 24 Thu Sep 4 13:06:31 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's restarted the process is skipped as the key should already be there. Unfortunately the key isn't there, which is a bit of a mystery, especially since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal to all processes. Meaning that the apply-local spawned by cirros-per at [4] can be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker in any case [5].
This creates the conditions for a situation where the marker can be present without having actually completed the apply-local phase. As a result it is possible to have guests without SSH ...

So this is what I found out.

Instance log from a failing istance [1]. The important bit there is "cirros-apply-local already run per instance", and not "no userdata for datasource" as initially thought. That was just me being stupid and thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console log for all SSH protocol banner failures [2]. the presence of duplicates makes it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH connection fails there is no authorized_keys file in /home/cirros/.ssh. This obviously explains the authentication failure. Whether the subsequent SSH protocol banner errors are due to the cited MTU problems or else it has to be clarified yet.
What is certain is that cirros processes the data source containing the public SSH key before starting sshd. So the auth failures cannot be due to the init process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis. This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh public keys.
It is called by the cirros-per scripts [4], which at the end of its execution writes a marker file [5]. The cirros-per process will terminate if when executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log: 
[    3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC (1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r--    1 root     root            35 Thu Sep  4 13:06:28 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r--    1 root     root            22 Thu Sep  4 13:05:07 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r--    1 root     root            24 Thu Sep  4 13:06:31 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's restarted the process is skipped as the key should already be there. Unfortunately the key isn't there, which is a bit of a mystery, especially since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal to all processes. Meaning that the apply-local spawned by cirros-per at [4] can be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker in any case [5]. 
This creates the conditions for a situation where the marker can be present without having actually completed the apply-local phase. As a result it is possible to have guests without SSH public key which manifest the failure reported in this bug.

Why is this happening only recently.
It seems a paradox, but [7] might be the reason.
This patch (and a few similar others) introduced soft instance shutdown. Soft instance shutdown avoid the abrupt shutdown which can actually leave the cirros-init process incomplete.
However, since cirros-per writes the marker regardless whether the process it called terminated with 0 or other code, it does not guarantee a successful completion.

On the other hand, introducing [7] added a delay before stopping the instance. For instance in the case [8] it took 13 seconds. Previously tempest was just immediately powering off the instance, not giving it a chance to run cirros-init. Now, with the added dealy, the cirros-init process is executed and this might be the reason that this failure, which was previously occasional, has recently become the biggest gate breaker.

What are the possible solutions?
1) fix cirros-per to not write the marker if the called process returned a non-zero value. The feasibility of this depends on wheter cirros-apply can be considered idempotent
2) adjust tempest to wait a little after the instance become ACTIVE. It could wait a fixed amount of time just to ensure instance initialization is completed.
3) Your proposal here.

[1] http://paste.openstack.org/show/106049/
[2] http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2lycm9zLWFwcGx5LWxvY2FsIGFscmVhZHkgcnVuIHBlciBpbnN0YW5jZVwiICBBTkQgdGFnczpjb25zb2xlIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA5ODc0NDg4NDM0fQ==
[3] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/sbin/cirros-apply
[4] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L102
[5] http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L106
[6]  http://bazaar.launchpad.net/~cirros-dev/cirros/0.3/view/head:/src/bin/cirros-per#L92
[7] http://git.openstack.org/cgit/openstack/nova/diff/nova/virt/libvirt/driver.py?id=c07ed15415c0ec3c5862f437f440632eff1e94df
[8] http://logs.openstack.org/95/117695/5/gate/gate-tempest-dsvm-neutron-full/d28fe19/logs/screen-n-cpu.txt.gz#_2014-09-02_16_36_04_542

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-05: Fix proposed to tempest (master)

#24

Fix proposed to branch: master
Review: https://review.openstack.org/119267

Changed in tempest:
assignee:	Salvatore Orlando (salvatore-orlando) → Joe Gordon (jogo)
status:	New → In Progress
assignee:	Joe Gordon (jogo) → Matthew Treinish (treinish)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-05:

#25

Fix proposed to branch: master
Review: https://review.openstack.org/119268

Salvatore Orlando (salvatore-orlando) on 2014-09-05

Changed in neutron:
status:	New → Incomplete
Changed in nova:
status:	Confirmed → Incomplete
Changed in grenade:
status:	New → Incomplete

OpenStack Infra (hudson-openstack) on 2014-09-05

Changed in tempest:
assignee:	Matthew Treinish (treinish) → Joe Gordon (jogo)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-05: Change abandoned on tempest (master)

#26

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/119267
Reason: duplicate of https://review.openstack.org/#/c/119268

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-05: Fix merged to tempest (master)

#27

Reviewed: https://review.openstack.org/119268
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=cd879c5287f4c260b1ec29e593dcad3efcfe5af7
Submitter: Jenkins
Branch: master

commit cd879c5287f4c260b1ec29e593dcad3efcfe5af7
Author: Matthew Treinish <email address hidden>
Date: Thu Sep 4 20:41:48 2014 -0400

Verify network connectivity before state check

    This commit adds an initial ssh connection after bringing a server up
    in setUp. This should ensure that the image has a chance to initialize
    prior to messing with it's state. The test's here are to verify that
    after performing a nova operation on a running instance network
    connectivity is retained. However, it's is never checked that we can
    connect to the server in the first place. A probable cause for the
    constant ssh failures in these tests is that the server hasn't had a
    finish it's cloud-init (or cirros-init) stage when we're stopping it,
    this should also fix those issues.

Change-Id: I126fd4943582c4b759b3cc5a67babaa8d062fb4d
Partial-Bug: #1349617

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-09-06:

#28

No failure in neutron jobs since the patch merged (11 hours now)
3 failures in grenade-partial-ncpu (in gate).
The patch was not expected to fix the grenade job. If I'm not mistaken this job run's icehouse n-cpu on the 'new' part of grenade, and therefore the failure might occur because the instance if being abruptly shut down and then resumed.

Salvatore Orlando (salvatore-orlando) on 2014-09-11

Changed in neutron:
milestone:	juno-rc1 → none
assignee:	Salvatore Orlando (salvatore-orlando) → nobody

Revision history for this message

Baodong (Robert) Li (baoli) wrote on 2014-09-16: Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

#29

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Revision history for this message

Irena Berezovsky (irenab) wrote on 2014-09-16:

#30

Hi Robert,
vnic_type was added to neutron to be used with ML2.
You can also see it in the blueprint description: https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type

I second Salvatore's suggestion to default nova to VNIC_NORMAL, if binding:vnic_type is not specified by neutron.
Cheers,
Irena

From: Robert Li (baoli) [mailto:<email address hidden>]
Sent: Tuesday, September 16, 2014 7:13 PM
To: Irena Berezovsky
Cc: Salvatore Orlando; Bob Melander (bmelande)
Subject: Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Revision history for this message

Baodong (Robert) Li (baoli) wrote on 2014-09-17:

#31

Hi Irena,

I was thinking about doing it from Nova side as well. In that case, I will close 1370077 and create one from Nova side.

—Robert

On 9/16/14, 3:56 PM, "Irena Berezovsky" <<email address hidden><mailto:<email address hidden>>> wrote:

Hi Robert,
vnic_type was added to neutron to be used with ML2.
You can also see it in the blueprint description: https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type

I second Salvatore’s suggestion to default nova to VNIC_NORMAL, if binding:vnic_type is not specified by neutron.
Cheers,
Irena

From: Robert Li (baoli) [mailto:<email address hidden>]
Sent: Tuesday, September 16, 2014 7:13 PM
To: Irena Berezovsky
Cc: Salvatore Orlando; Bob Melander (bmelande)
Subject: Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Joe Gordon (jogo) on 2014-09-17

Changed in tempest:
assignee:	Joe Gordon (jogo) → nobody
status:	In Progress → New
Changed in nova:
milestone:	none → juno-rc1

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-18:

#32

unclear if this is fixed or not, there was a hit a single hit in the check queue on September 15th. No hits in the gate queue in over a week.

Changed in nova:
importance:	Critical → Undecided

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-18:

#33

It looks like https://review.openstack.org/#/c/119268/ may have fixed it.

Changed in tempest:
status:	New → Confirmed
assignee:	nobody → Matthew Treinish (treinish)
status:	Confirmed → Fix Committed
Changed in nova:
milestone:	juno-rc1 → none

Revision history for this message

Matthew Treinish (treinish) wrote on 2014-09-18:

#34

affects: tempest
status: fixreleased

Changed in tempest:
importance:	Undecided → Critical
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-25: Related fix proposed to tempest (master)

#35

Related fix proposed to branch: master
Review: https://review.openstack.org/137096

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-26: Related fix merged to tempest (master)

#36

Reviewed: https://review.openstack.org/137096
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1fd223e750048f8f39dea2f1b3fc6c73ff0b27d1
Submitter: Jenkins
Branch: master

commit 1fd223e750048f8f39dea2f1b3fc6c73ff0b27d1
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 25 07:16:09 2014 -0800

Skip test_volume_boot_pattern until bug 1373513 is fixed

    Between the races to delete a volume and hitting timeouts because things
    are hanging with lvm in Cinder and the various SSH timeouts, this test
    is a constant burden.

The SSH problems have been around for a long time and don't seem to be
getting any new attention.

    The Cinder volume delete hangs have also been around for awhile now and
    don't seem to be getting much serious attention, so until the Cinder
    volume delete hangs are fixed (or at least getting some serious
    attention), let's just skip this test scenario.

    Related-Bug: #1373513
    Related-Bug: #1370496
    Related-Bug: #1349617

Change-Id: Idb50bcdbc9683d322e9292abf50404e885a11a8e

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2015-01-14:

#37

I'm seeing a problem that appears to map to this bug, and I'm unclear whether that's expected (i.e. because there are parts of this bug for which fixes have not yet propagated everywhere), or if my problem should be reported as new.

Specifically, in the check-tempest-dsvm-docker check for https://review.openstack.org/#/c/146914/, I'm seeing:

2015-01-13 21:38:10.693 | Traceback (most recent call last):
2015-01-13 21:38:10.693 | File "tempest/test.py", line 112, in wrapper
2015-01-13 21:38:10.693 | return f(self, *func_args, **func_kwargs)
2015-01-13 21:38:10.693 | File "tempest/scenario/test_snapshot_pattern.py", line 72, in test_snapshot_pattern
2015-01-13 21:38:10.693 | self._write_timestamp(fip_for_server['ip'])
2015-01-13 21:38:10.693 | File "tempest/scenario/test_snapshot_pattern.py", line 51, in _write_timestamp
2015-01-13 21:38:10.693 | ssh_client = self.get_remote_client(server_or_ip)
2015-01-13 21:38:10.693 | File "tempest/scenario/manager.py", line 317, in get_remote_client
2015-01-13 21:38:10.693 | linux_client.validate_authentication()
2015-01-13 21:38:10.694 | File "tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
2015-01-13 21:38:10.694 | self.ssh_client.test_connection_auth()
2015-01-13 21:38:10.694 | File "tempest/common/ssh.py", line 151, in test_connection_auth
2015-01-13 21:38:10.694 | connection = self._get_ssh_connection()
2015-01-13 21:38:10.694 | File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2015-01-13 21:38:10.694 | password=self.password)
2015-01-13 21:38:10.694 | SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2015-01-13 21:38:10.694 | User: cirros, Password: None

Searching maps that symptom to https://bugs.launchpad.net/grenade/+bug/1362554, which is a duplicate of this one.

Please can you advise whether this is expected, or something new?

Thanks - Neil

I'm seeing a problem that appears to map to this bug, and I'm unclear whether that's expected (i.e. because there are parts of this bug for which fixes have not yet propagated everywhere), or if my problem should be reported as new.

Specifically, in the check-tempest-dsvm-docker check for https://review.openstack.org/#/c/146914/, I'm seeing:

2015-01-13 21:38:10.693 |     Traceback (most recent call last):
2015-01-13 21:38:10.693 |       File "tempest/test.py", line 112, in wrapper
2015-01-13 21:38:10.693 |         return f(self, *func_args, **func_kwargs)
2015-01-13 21:38:10.693 |       File "tempest/scenario/test_snapshot_pattern.py", line 72, in test_snapshot_pattern
2015-01-13 21:38:10.693 |         self._write_timestamp(fip_for_server['ip'])
2015-01-13 21:38:10.693 |       File "tempest/scenario/test_snapshot_pattern.py", line 51, in _write_timestamp
2015-01-13 21:38:10.693 |         ssh_client = self.get_remote_client(server_or_ip)
2015-01-13 21:38:10.693 |       File "tempest/scenario/manager.py", line 317, in get_remote_client
2015-01-13 21:38:10.693 |         linux_client.validate_authentication()
2015-01-13 21:38:10.694 |       File "tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
2015-01-13 21:38:10.694 |         self.ssh_client.test_connection_auth()
2015-01-13 21:38:10.694 |       File "tempest/common/ssh.py", line 151, in test_connection_auth
2015-01-13 21:38:10.694 |         connection = self._get_ssh_connection()
2015-01-13 21:38:10.694 |       File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2015-01-13 21:38:10.694 |         password=self.password)
2015-01-13 21:38:10.694 |     SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2015-01-13 21:38:10.694 |     User: cirros, Password: None

Searching maps that symptom to https://bugs.launchpad.net/grenade/+bug/1362554, which is a duplicate of this one.

Please can you advise whether this is expected, or something new?

Thanks - Neil

Revision history for this message

Rick Chen (rick-chen) wrote on 2015-01-21:

#38

I got same issue in my OpenStack CI. Please advise me, Thanks.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2015-03-28:

#39

Looking through the comments, I am unsure whether the really is a bug in cirros involved here, or whether the issue was only triggered by being stopped too fast during cloud-init.

Changed in cirros:
status:	New → Incomplete

Armando Migliaccio (armando-migliaccio) on 2015-10-12

Changed in neutron:
importance:	High → Undecided

Sean Dague (sdague) on 2016-02-20

Changed in grenade:
status:	Incomplete → Invalid

Revision history for this message

Augustina Ragwitz (auggy) wrote on 2016-03-08:

#40

Nova: Fix was released for related bug https://bugs.launchpad.net/nova/+bug/1532809

https://review.openstack.org/#/c/273042/

If this issue rears up again, please open a new bug for Nova.

Changed in nova:
status:	Incomplete → Fix Released
assignee:	nobody → Augustina Ragwitz (auggy)

Scott Moser (smoser) on 2016-06-29

no longer affects:

cirros

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-12-23:

#41

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

grenade

SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Undecided	Augustina Ragwitz
grenade	Invalid	Undecided	Unassigned
neutron	Incomplete	Undecided	Unassigned
tempest	Fix Released	Critical	Matthew Treinish