Intermittent failure running new instances

Bug #1012595 reported by Thierry Carrez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Fix Released
Critical
James E. Blair

Bug Description

There is an intermittent failure in devstack-gate that prevents tests from passing, but probably pointing to a real non-deterministic bug.

I narrowed it down to a failure running new instances, with the following error in do_run_instances:

libvir: QEMU error : Domain not found: no domain with matching name 'instance-00000001'

Logs can be seen in various devstack-gate job failures at:
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm

Including:
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5078/
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5044/

I'll try to pinpoint when this first started.

Revision history for this message
Thierry Carrez (ttx) wrote :

First occurence seems to be:

https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5038/
June 12, 19:01 UTC

Although since the change is intermittent and "devstack-gate tests have passed recently without actually running any tests" this is a fuzzy match.

Revision history for this message
Thierry Carrez (ttx) wrote :

All fails are on hpcloud machines, All passes are on rackspace machines, so this may actually be something linked to the test setup...

Last hpcloud OK run is 5037(June 12, 18:57 UTC).
First hpcloud fail is 5038 (June 12, 19:01 UTC). Since then they just all failed on hpcloud runs.

Moving to openstack-ci until we can pinpoint to something in Nova.

affects: nova → openstack-ci
Changed in openstack-ci:
milestone: folsom-2 → none
Revision history for this message
Thierry Carrez (ttx) wrote :

This is as if virtualization support was removed from hpcloud-precise machines around June 12, 19:00 UTC.
I suggest we disable hpcloud machines while we investigate, to remove the false negatives.

Revision history for this message
James E. Blair (corvus) wrote :

They have been removed from the pool.

Before June 12, a flaw in hpcloud machines was causing tests to pass without being run. That's when we worked around that issue. So the failures could have been happening before then, and possibly since the time we switched to precise images.

Since devstack is configuring bare QEMU virtualization, it's not clear to me what kind of support could be missing to cause that.

Changed in openstack-ci:
assignee: nobody → James E. Blair (corvus)
Revision history for this message
James E. Blair (corvus) wrote :

It looks like this was due to hpcloud not adding the hostname to /etc/hosts on their precise images. The devstack gate script now does that.

Changed in openstack-ci:
status: Confirmed → Fix Released
milestone: none → folsom
Revision history for this message
Soren Hansen (soren) wrote :

Just to record it for posterity:

cloud-init in Oneiric (only! (see bug 890501 and bug 871966 for context)) would add an entry to /etc/hosts, which is why the problem didn't exist there. In Precise, there's no entry for $fqdn in /etc/hosts, so we rely on DNS to look that up. However, the DNS lookup coincided with the reconfiguration of the network that happens when the first instance is run on a nova-compute node.

The original analysis was thus wrong. Seeing things like:

libvir: QEMU error : Domain not found: no domain with matching name 'instance-00000001'

is perfectly normal and expected behaviour prior to the first launch of an instance.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.