OpenStack Core Infrastructure

Intermittent failure running new instances

Bug #1012595 reported by Thierry Carrez on 2012-06-13

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Core Infrastructure	Fix Released	Critical	James E. Blair	OpenStack Core Infrastructure folsom

Bug Description

There is an intermittent failure in devstack-gate that prevents tests from passing, but probably pointing to a real non-deterministic bug.

I narrowed it down to a failure running new instances, with the following error in do_run_instances:

libvir: QEMU error : Domain not found: no domain with matching name 'instance-00000001'

Logs can be seen in various devstack-gate job failures at:
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm

Including:
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5078/
https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5044/

I'll try to pinpoint when this first started.

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-06-13:

#1

First occurence seems to be:

https://jenkins.openstack.org/job/gate-integration-tests-devstack-vm/5038/
June 12, 19:01 UTC

Although since the change is intermittent and "devstack-gate tests have passed recently without actually running any tests" this is a fuzzy match.

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-06-13:

#2

All fails are on hpcloud machines, All passes are on rackspace machines, so this may actually be something linked to the test setup...

Last hpcloud OK run is 5037(June 12, 18:57 UTC).
First hpcloud fail is 5038 (June 12, 19:01 UTC). Since then they just all failed on hpcloud runs.

Moving to openstack-ci until we can pinpoint to something in Nova.

affects:	nova → openstack-ci
Changed in openstack-ci:
milestone:	folsom-2 → none

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-06-13:

#3

This is as if virtualization support was removed from hpcloud-precise machines around June 12, 19:00 UTC.
I suggest we disable hpcloud machines while we investigate, to remove the false negatives.

Revision history for this message

James E. Blair (corvus) wrote on 2012-06-13:

#4

They have been removed from the pool.

Before June 12, a flaw in hpcloud machines was causing tests to pass without being run. That's when we worked around that issue. So the failures could have been happening before then, and possibly since the time we switched to precise images.

Since devstack is configuring bare QEMU virtualization, it's not clear to me what kind of support could be missing to cause that.

Changed in openstack-ci:
assignee:	nobody → James E. Blair (corvus)

Revision history for this message

James E. Blair (corvus) wrote on 2012-06-13:

#5

It looks like this was due to hpcloud not adding the hostname to /etc/hosts on their precise images. The devstack gate script now does that.

Changed in openstack-ci:
status:	Confirmed → Fix Released
milestone:	none → folsom

Revision history for this message

Soren Hansen (soren) wrote on 2012-06-14:

#6

Just to record it for posterity:

cloud-init in Oneiric (only! (see bug 890501 and bug 871966 for context)) would add an entry to /etc/hosts, which is why the problem didn't exist there. In Precise, there's no entry for $fqdn in /etc/hosts, so we rely on DNS to look that up. However, the DNS lookup coincided with the reconfiguration of the network that happens when the first instance is run on a nova-compute node.

The original analysis was thus wrong. Seeing things like:

libvir: QEMU error : Domain not found: no domain with matching name 'instance-00000001'

is perfectly normal and expected behaviour prior to the first launch of an instance.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.