SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

Bug #1349617 reported by Matt Riedemann
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Augustina Ragwitz
grenade
Invalid
Undecided
Unassigned
neutron
Incomplete
Undecided
Unassigned
tempest
Fix Released
Critical
Matthew Treinish

Bug Description

Tags: network
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Huh, I see this in the n-net logs:

2014-07-27 20:11:47.967 DEBUG nova.network.manager [req-802f7e4b-3989-4343-94d0-849cefdb64aa TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Allocated fixed ip None on network 27dd907f-ec5f-4e9e-b369-a5a3b6bd13fa allocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:925

Notice the None, that seems odd...

I do see this later:

2014-07-27 20:12:16.240 DEBUG nova.network.manager [req-94127694-71f3-46d2-a62c-118a4d1556cb TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Network deallocation for instance deallocate_for_instance /opt/stack/new/nova/nova/network/manager.py:561
2014-07-27 20:12:16.279 DEBUG nova.network.manager [req-94127694-71f3-46d2-a62c-118a4d1556cb TestVolumeBootPattern-32554776 TestVolumeBootPattern-422744072] [instance: 5ba6082f-5742-447a-9d56-bb52ae8634fb] Deallocate fixed ip 10.1.0.3 deallocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:946

So when was the fixed IP actually allocated, or is that just a logging bug?

tags: added: network
Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe bug 1349590 is related, that's a nova-network issue with floating IPs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/110384

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: test_volume_boot_pattern fails in grenade with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

Related change to nova to add more trace logging to the allocate_fixed_ip method in NetworkManager:

https://review.openstack.org/#/c/110384/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/110384
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a4c580ff03f4abb03970dd6de315ca0ba6849617
Submitter: Jenkins
Branch: master

commit a4c580ff03f4abb03970dd6de315ca0ba6849617
Author: Matt Riedemann <email address hidden>
Date: Tue Jul 29 10:18:13 2014 -0700

    Add trace logging to allocate_fixed_ip

    The address is being logged as None in some cases
    that are failing in grenade jobs so this adds more
    trace logging to the base network manager's
    allocate_fixed_ip method so we can see which paths
    are being taken in the code and what the outputs
    are.

    Change-Id: I37de4b3bbb9e51b57eb4d048e05fc00382eed23d
    Related-Bug: #1349617

Revision history for this message
jiang, yunhong (yunhong-jiang) wrote : Re: test_volume_boot_pattern fails in grenade with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"
Download full text (12.3 KiB)

I hit a similar issue, but a bit different, in http://logs.openstack.org/53/76053/16/check/check-grenade-dsvm-partial-ncpu/5a53b07/console.html#_2014-08-18_16_36_31_962 . Seems sometime it will fail to connected, sometime it fail to get the banner.

2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,400 8863 INFO [tempest.common.ssh] Creating ssh connection to '172.24.4.1' as 'cirros' with public key authentication
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,412 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,589 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:03,591 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 1. Retry after 2 seconds.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,101 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,273 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:06,276 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 2. Retry after 3 seconds.
2014-08-18 16:36:31.962 | 2014-08-18 16:33:09,786 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.962 | 2014-08-18 16:33:09,961 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:09,963 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 3. Retry after 4 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,475 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,645 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:14,649 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 4. Retry after 5 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,161 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,331 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:20,335 8863 WARNING [tempest.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.4.1 (Authentication failed.). Number attempts: 5. Retry after 6 seconds.
2014-08-18 16:36:31.963 | 2014-08-18 16:33:26,847 8863 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_6.6.1p1)
2014-08-18 16:36:31.963 | 2014-08-18 16:33:27,018 8863 INFO [paramiko.transport] Authentication (publickey) failed.
2014-08-18 16:36:31.964 | 2014-08-18 16:33:27,020 8863 WARNING [tem...

Changed in neutron:
importance: Undecided → High
assignee: nobody → Salvatore Orlando (salvatore-orlando)
importance: High → Critical
milestone: none → juno-3
Changed in neutron:
importance: Critical → High
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I noticed a rather high correlation between messages in console log pertaining cirros data sources and this kind of failure.

The messages are:
EC2 Metadata: "/run/cirros/datasource/data/user-data was not '#!' or executable" (executed on neutron PG jobs)
Config drive: no userdata for datasource"

I've tried to nail down the correlation with the query [1]. Despite being horrible I think it does a good job in showing that when this kind of SSH error occurs, the cirros data source message appears as well. Note that query also captures grenade-partial-ncpu failures.

This seems to point to a metadata problem. It could be race, but I have no other element so far.
It is possible the metadata server provides invalid user data (SSH keys in this case) to the instance. And similar during spawn the wrong info would be put in the config driver. However, I've not yet found anything to support this hypothesis.
The clues we have so far point at SSH keys as the connection is successfully established (and ping tests passes too). However, there are authentication failures in the log [2]. The SSH protocol banner error probably appears only once the cirros instance stops accepting ssh connection requests.

One weird thing to note is that [3] was pushed to revert a change which introduced this exact same issue. And the reverted change at the end of the day just added some delay to the tests.

[1] http://logstash.openstack.org/#eyJzZWFyY2giOiIoKG1lc3NhZ2U6XCIvcnVuL2NpcnJvcy9kYXRhc291cmNlL2RhdGEvdXNlci1kYXRhIHdhcyBub3QgJyMhJyBvciBleGVjdXRhYmxlXCIgT1IgbWVzc2FnZTpcIm5vIHVzZXJkYXRhIGZvciBkYXRhc291cmNlXCIpICBBTkQgTk9UIG1lc3NhZ2U6XCJSRVNQIEJPRFlcIiAgQU5EIE5PVCBidWlsZF9uYW1lOmNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWR2ciBBTkQgTk9UIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pY2Vob3VzZVwiIEFORCBOT1QgYnVpbGRfbmFtZTpcImNoZWNrLXRlbXBlc3QtZHN2bS1uZXV0cm9uLWZ1bGwtaWNlaG91c2VcIiBBTkQgdGFnczpjb25zb2xlKSBPUiAobWVzc2FnZTpcIlRSQUNFXCIgQU5EIG1lc3NhZ2U6XCJTU0hFeGNlcHRpb246IEVycm9yIHJlYWRpbmcgU1NIIHByb3RvY29sIGJhbm5lclwiIEFORCB0YWdzOlwiY29uc29sZVwiKSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwODkyMzMyODM0OSwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[2] http://logs.openstack.org/08/111008/12/check/check-grenade-dsvm-partial-ncpu/cc59cae/console.html#_2014-08-24_11_20_41_834
[3] https://review.openstack.org/#/c/97245/

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Just noticed a similar SSH time outs with "check-grenade-dsvm-partial-ncpu'" test job[1] from test 'tempest/scenario/test_snapshot_pattern.py':

-------------
.
.
2014-08-27 08:28:47.776 | 2014-08-27 08:28:41,120 9490 INFO [tempest.common.debug] Host ns list[]
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41,121 9490 ERROR [tempest.scenario.test_snapshot_pattern] Initializing SSH connection failed
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern Traceback (most recent call last):
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/scenario/test_snapshot_pattern.py", line 52, in _ssh_to_server
2014-08-27 08:28:47.777 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern return self.get_remote_client(server_or_ip)
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/scenario/manager.py", line 332, in get_remote_client
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern linux_client.validate_authentication()
2014-08-27 08:28:47.778 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/utils/linux/remote_client.py", line 54, in validate_authentication
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern self.ssh_client.test_connection_auth()
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/ssh.py", line 151, in test_connection_auth
2014-08-27 08:28:47.779 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern connection = self._get_ssh_connection()
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern password=self.password)
2014-08-27 08:28:47.780 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2014-08-27 08:28:47.781 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern User: cirros, Password: None
2014-08-27 08:28:47.781 | 2014-08-27 08:28:41.121 9490 TRACE tempest.scenario.test_snapshot_pattern
.
.
-------------

  [1] http://logs.openstack.org/04/117104/2/check/check-grenade-dsvm-partial-ncpu/d3829fe/console.html

Revision history for this message
melanie witt (melwitt) wrote :

Here is what I’ve gathered so far. I looked through a few failed builds and focused on one [0] that uses the metadata service rather than config drive as it gives more clues.

1. The messages about “userdata” in the guest console don’t seem related to the failure i.e. the guest console only shows up in the logs if the build fails. I think it always says "/run/cirros/datasource/data/user-data was not '#!' or executable" or “no userdata for datasource" if no “userdata” is being used, and none is. The ssh keys are part of the metadata in these tests, not the userdata portion of the metadata.

2. In the metadata service log [1], there are zero calls to e.g. "GET /2009-04-04/meta-data/user-data HTTP/1.1" further supporting no userdata relationship.

3. Ssh keys are added to the metadata in nova/api/metadata.py by nova itself, so it appears unlikely there is anything wrong there, or at least I didn’t see anything unusual. The key is created by a POST to nova [2] and nova creates the key. The key content then appears several times in the log messages of the metadata service (it seems fine, uncorrupted).

4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).

5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].

Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.

[0] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83
[1] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/logs/screen-q-meta.txt.gz
[2] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_546
[3] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_659
[4] https://ask.openstack.org/en/question/32958/unable-to-ssh-with-key-pair/
[5] https://bugs.launchpad.net/cirros/+bug/1301958

summary: - test_volume_boot_pattern fails in grenade with "SSHException: Error
- reading SSH protocol banner[Errno 104] Connection reset by peer"
+ test_volume_boot_pattern fails with "SSHException: Error reading SSH
+ protocol banner[Errno 104] Connection reset by peer"
Joe Gordon (jogo)
Changed in nova:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote : Re: test_volume_boot_pattern fails with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

I think we should focus on two aspects:
1) Ping works otherwise we won't get to SSH test
2) SSH connections shows always authentication failures before 'SSH protocol banner' errors.

I don't know about the MTU possibility, but I wouldn't expect it to happen on single host tests.

Revision history for this message
melanie witt (melwitt) wrote :

I was thinking maybe the auth failure might happen before the guest reads the public key from metadata, then after it reads a corrupted key, it keeps sending back a truncated or otherwise invalid data in response to the SSH connection request. I read more about the paramiko error "Error reading SSH protocol banner[Errno 104]" and it can also mean the remote host didn't send a banner at all (not responding at all, like Salvatore mentioned in comment #10).

I combed logs some more and didn't find anything useful so I'm now going to try to reproduce the issue locally using devstack. I'd like to see the logs inside the guest (sshd logs, etc) after this happens. Which makes me wonder if we could add something to tempest to mount the guest disk if ssh failure like this happens and capture some of the guest logs for debugging.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Melanie,

we have been discussing this issue in openstack-qa.
since we too have been unable to find any evidence regarding issues with user data, we're going to validate the MTU hypothesis you made.

I'm going to push a patch to match it to cirros' MTU in the gate.
On the other hand a new patches cirros build with the fix for the bug you pointed out will be released soon.

Revision history for this message
melanie witt (melwitt) wrote :

Salvatore,

Okay. I agree MTU seems unlikely to be the issue but I'm glad if we can rule it out for sure.

Do you think we could do a verbose ssh in the tempest test (like ssh -vvv) to see the details of the exchange when the failure happens?

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I don't think paramiko allow us to do that. Bypassing paramiko in tempest is too much code churn I think.

I will try to reproduce in a local environment. it should not be too hard as I can intercept this failure also on VMware NSX-CI.

Revision history for this message
melanie witt (melwitt) wrote :

@Salvatore,

It looks like paramiko.util.log_to_file(filename, level=10) [0] could do the verbose information -- it might be nice if we can have it as a separate log file e.g. "paramiko.txt" that goes alongside syslog.txt etc. The default log level 10 is debug.

[0] http://www.lag.net/paramiko/docs/paramiko.util-module.html#log_to_file

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks for the pointer melanie. I'll see locally first how hard it would be and whether it requires changes on the infrastructure side. This is debugging info worth having (unlike the pedant namespace info we dump which I never find useful).

Revision history for this message
melanie witt (melwitt) wrote :

Cool. :) I'm trying some things locally in tempest too to see what happens when I call the log_to_file function. If I get something working in tempest, I'll put up a patch (if you haven't already found a way).

summary: - test_volume_boot_pattern fails with "SSHException: Error reading SSH
- protocol banner[Errno 104] Connection reset by peer"
+ SSHException: Error reading SSH protocol banner[Errno 104] Connection
+ reset by peer
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-3 → juno-rc1
Revision history for this message
melanie witt (melwitt) wrote :

FYI I have just put up this patch to send paramiko logs to file during tempest runs:

https://review.openstack.org/#/c/118946

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks melanie - that's good stuff to have.
I have a few local repro environments locally when I'm running a tweaked tempest that will not destroy the vm to which the SSH connection failed.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I reproduced the failure and I can confirm I have no authorized_keys file in the failing instance.
To reproduce the failure it is sufficient to start an instance with 4 cores and 8GB of memory, launch devstack with a localrc very similar to that of the full neutron test, and then keep running scenario tests.

A tweak for not removing the instance where ssh fails helps a lot: http://paste.openstack.org/show/105982/

Revision history for this message
melanie witt (melwitt) wrote :

Awesome Salvatore, thanks for sharing that patch.

So it's running the latest Cirros 0.3.2 which I see fixed some bugs related to getting metadata [1]. Do you see anything interesting in /var/log/cloud-init.log in the VM?

[1] https://launchpad.net/cirros/trunk/0.3.2

Changed in tempest:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :
Download full text (5.4 KiB)

So this is what I found out.

Instance log from a failing istance [1]. The important bit there is "cirros-apply-local already run per instance", and not "no userdata for datasource" as initially thought. That was just me being stupid and thinking the public key was part of user data. That was really silly.

"cirros-apply-local already run per instance" seems to appear in the console log for all SSH protocol banner failures [2]. the presence of duplicates makes it difficult to prove correlation with SSH protocol banner failures.
However, they key here is that local testing revealing that when the SSH connection fails there is no authorized_keys file in /home/cirros/.ssh. This obviously explains the authentication failure. Whether the subsequent SSH protocol banner errors are due to the cited MTU problems or else it has to be clarified yet.
What is certain is that cirros processes the data source containing the public SSH key before starting sshd. So the auth failures cannot be due to the init process not being yet complete.

The cirros initialization process executes a set of steps on an instance basis. This steps include setting public ssh keys.
"On an instance basis " means that these steps are not executed at each boot but once per instance.

cirros-apply local [3] is the step which processes, among other things, ssh public keys.
It is called by the cirros-per scripts [4], which at the end of its execution writes a marker file [5]. The cirros-per process will terminate if when executed the marker file is already present [6]

During the failing test it has been observed the following:

from the console log:
[ 3.696172] rtc_cmos 00:01: setting system clock to 2014-09-04 19:05:27 UTC (1409857527)

from the cirros-apply marker directory:
$ ls -le /var/lib/cirros/sem/
total 3
-rw-r--r-- 1 root root 35 Thu Sep 4 13:06:28 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.check-version
-rw-r--r-- 1 root root 22 Thu Sep 4 13:05:07 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.cirros-apply-local
-rw-r--r-- 1 root root 24 Thu Sep 4 13:06:31 2014 instance.197ce1ac-e2df-4d3a-b392-4803383ddf74.userdata

as cirros defaults to MDT (UTC -6), this means the apply-local marker has been applied BEFORE instance boot.
This is consistent with the situation we're seeing where the failure always occur after events such as resize or stop.
The ssh public key should be applied in the first boot of the VM. When it's restarted the process is skipped as the key should already be there. Unfortunately the key isn't there, which is a bit of a mystery, especially since the instance is powered off in a graceful way thanks to [7].

Nevertheless when an instance receives a shutdown signal it sends a TERM signal to all processes. Meaning that the apply-local spawned by cirros-per at [4] can be killed before it actually writes the key.
However, cirros-per even if it retrieves the return code it writes the marker in any case [5].
This creates the conditions for a situation where the marker can be present without having actually completed the apply-local phase. As a result it is possible to have guests without SSH ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/119267

Changed in tempest:
assignee: Salvatore Orlando (salvatore-orlando) → Joe Gordon (jogo)
status: New → In Progress
assignee: Joe Gordon (jogo) → Matthew Treinish (treinish)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/119268

Changed in neutron:
status: New → Incomplete
Changed in nova:
status: Confirmed → Incomplete
Changed in grenade:
status: New → Incomplete
Changed in tempest:
assignee: Matthew Treinish (treinish) → Joe Gordon (jogo)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/119267
Reason: duplicate of https://review.openstack.org/#/c/119268

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/119268
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=cd879c5287f4c260b1ec29e593dcad3efcfe5af7
Submitter: Jenkins
Branch: master

commit cd879c5287f4c260b1ec29e593dcad3efcfe5af7
Author: Matthew Treinish <email address hidden>
Date: Thu Sep 4 20:41:48 2014 -0400

    Verify network connectivity before state check

    This commit adds an initial ssh connection after bringing a server up
    in setUp. This should ensure that the image has a chance to initialize
    prior to messing with it's state. The test's here are to verify that
    after performing a nova operation on a running instance network
    connectivity is retained. However, it's is never checked that we can
    connect to the server in the first place. A probable cause for the
    constant ssh failures in these tests is that the server hasn't had a
    finish it's cloud-init (or cirros-init) stage when we're stopping it,
    this should also fix those issues.

    Change-Id: I126fd4943582c4b759b3cc5a67babaa8d062fb4d
    Partial-Bug: #1349617

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

No failure in neutron jobs since the patch merged (11 hours now)
3 failures in grenade-partial-ncpu (in gate).
The patch was not expected to fix the grenade job. If I'm not mistaken this job run's icehouse n-cpu on the 'new' part of grenade, and therefore the failure might occur because the instance if being abruptly shut down and then resumed.

Changed in neutron:
milestone: juno-rc1 → none
assignee: Salvatore Orlando (salvatore-orlando) → nobody
Revision history for this message
Baodong (Robert) Li (baoli) wrote : Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Revision history for this message
Irena Berezovsky (irenab) wrote :

Hi Robert,
vnic_type was added to neutron to be used with ML2.
You can also see it in the blueprint description: https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type

I second Salvatore's suggestion to default nova to VNIC_NORMAL, if binding:vnic_type is not specified by neutron.
Cheers,
Irena

From: Robert Li (baoli) [mailto:<email address hidden>]
Sent: Tuesday, September 16, 2014 7:13 PM
To: Irena Berezovsky
Cc: Salvatore Orlando; Bob Melander (bmelande)
Subject: Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Revision history for this message
Baodong (Robert) Li (baoli) wrote :

Hi Irena,

I was thinking about doing it from Nova side as well. In that case, I will close 1370077 and create one from Nova side.

—Robert

On 9/16/14, 3:56 PM, "Irena Berezovsky" <<email address hidden><mailto:<email address hidden>>> wrote:

Hi Robert,
vnic_type was added to neutron to be used with ML2.
You can also see it in the blueprint description: https://blueprints.launchpad.net/neutron/+spec/ml2-request-vnic-type

I second Salvatore’s suggestion to default nova to VNIC_NORMAL, if binding:vnic_type is not specified by neutron.
Cheers,
Irena

From: Robert Li (baoli) [mailto:<email address hidden>]
Sent: Tuesday, September 16, 2014 7:13 PM
To: Irena Berezovsky
Cc: Salvatore Orlando; Bob Melander (bmelande)
Subject: Default vnic_type, RE: https://bugs.launchpad.net/neutron/+bug/1370077

Hi Irena,

Do you remember why default vnic_type was not set in neutron when you were working on adding vnic_type into the port binding? Is there any reason not to do that? As you know, nova depends on this information to determine if sr-iov port should be allocated. Just want to check with you for the fix to 1370077.

Thanks,
Robert

Joe Gordon (jogo)
Changed in tempest:
assignee: Joe Gordon (jogo) → nobody
status: In Progress → New
Changed in nova:
milestone: none → juno-rc1
Revision history for this message
Joe Gordon (jogo) wrote :

unclear if this is fixed or not, there was a hit a single hit in the check queue on September 15th. No hits in the gate queue in over a week.

Changed in nova:
importance: Critical → Undecided
Revision history for this message
Joe Gordon (jogo) wrote :

It looks like https://review.openstack.org/#/c/119268/ may have fixed it.

Changed in tempest:
status: New → Confirmed
assignee: nobody → Matthew Treinish (treinish)
status: Confirmed → Fix Committed
Changed in nova:
milestone: juno-rc1 → none
Revision history for this message
Matthew Treinish (treinish) wrote :

affects: tempest
status: fixreleased

Changed in tempest:
importance: Undecided → Critical
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/137096

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/137096
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1fd223e750048f8f39dea2f1b3fc6c73ff0b27d1
Submitter: Jenkins
Branch: master

commit 1fd223e750048f8f39dea2f1b3fc6c73ff0b27d1
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 25 07:16:09 2014 -0800

    Skip test_volume_boot_pattern until bug 1373513 is fixed

    Between the races to delete a volume and hitting timeouts because things
    are hanging with lvm in Cinder and the various SSH timeouts, this test
    is a constant burden.

    The SSH problems have been around for a long time and don't seem to be
    getting any new attention.

    The Cinder volume delete hangs have also been around for awhile now and
    don't seem to be getting much serious attention, so until the Cinder
    volume delete hangs are fixed (or at least getting some serious
    attention), let's just skip this test scenario.

    Related-Bug: #1373513
    Related-Bug: #1370496
    Related-Bug: #1349617

    Change-Id: Idb50bcdbc9683d322e9292abf50404e885a11a8e

Revision history for this message
Nell Jerram (neil-jerram) wrote :

I'm seeing a problem that appears to map to this bug, and I'm unclear whether that's expected (i.e. because there are parts of this bug for which fixes have not yet propagated everywhere), or if my problem should be reported as new.

Specifically, in the check-tempest-dsvm-docker check for https://review.openstack.org/#/c/146914/, I'm seeing:

2015-01-13 21:38:10.693 | Traceback (most recent call last):
2015-01-13 21:38:10.693 | File "tempest/test.py", line 112, in wrapper
2015-01-13 21:38:10.693 | return f(self, *func_args, **func_kwargs)
2015-01-13 21:38:10.693 | File "tempest/scenario/test_snapshot_pattern.py", line 72, in test_snapshot_pattern
2015-01-13 21:38:10.693 | self._write_timestamp(fip_for_server['ip'])
2015-01-13 21:38:10.693 | File "tempest/scenario/test_snapshot_pattern.py", line 51, in _write_timestamp
2015-01-13 21:38:10.693 | ssh_client = self.get_remote_client(server_or_ip)
2015-01-13 21:38:10.693 | File "tempest/scenario/manager.py", line 317, in get_remote_client
2015-01-13 21:38:10.693 | linux_client.validate_authentication()
2015-01-13 21:38:10.694 | File "tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
2015-01-13 21:38:10.694 | self.ssh_client.test_connection_auth()
2015-01-13 21:38:10.694 | File "tempest/common/ssh.py", line 151, in test_connection_auth
2015-01-13 21:38:10.694 | connection = self._get_ssh_connection()
2015-01-13 21:38:10.694 | File "tempest/common/ssh.py", line 88, in _get_ssh_connection
2015-01-13 21:38:10.694 | password=self.password)
2015-01-13 21:38:10.694 | SSHTimeout: Connection to the 172.24.4.1 via SSH timed out.
2015-01-13 21:38:10.694 | User: cirros, Password: None

Searching maps that symptom to https://bugs.launchpad.net/grenade/+bug/1362554, which is a duplicate of this one.

Please can you advise whether this is expected, or something new?

Thanks - Neil

Revision history for this message
Rick Chen (rick-chen) wrote :

I got same issue in my OpenStack CI. Please advise me, Thanks.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Looking through the comments, I am unsure whether the really is a bug in cirros involved here, or whether the issue was only triggered by being stopped too fast during cloud-init.

Changed in cirros:
status: New → Incomplete
Changed in neutron:
importance: High → Undecided
Sean Dague (sdague)
Changed in grenade:
status: Incomplete → Invalid
Revision history for this message
Augustina Ragwitz (auggy) wrote :

Nova: Fix was released for related bug https://bugs.launchpad.net/nova/+bug/1532809

https://review.openstack.org/#/c/273042/

If this issue rears up again, please open a new bug for Nova.

Changed in nova:
status: Incomplete → Fix Released
assignee: nobody → Augustina Ragwitz (auggy)
Scott Moser (smoser)
no longer affects: cirros
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.