Race condition: boot a server on a network created by Heat

Bug #1546201 reported by Jordan Pittier
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Crag Wolfe

Bug Description

Hi,

From time to time the tempest test "tempest.api.orchestration.stacks.test_neutron_resources.NeutronResourcesTestJSON" fails with the exception

tempest.exceptions.StackBuildErrorException: Stack xxx is in CREATE_FAILED status due to 'Resource CREATE failed: BadRequest: resources.Server: Network xxx requires a subnet in order to boot instances on.

Example of such a failure in [1]

The Heat stack for this test is [2]

A logstack query for this is "message:"requires a subnet in order to boot instances on." AND filename:"logs/screen-h-eng.txt"

[1]: http://logs.openstack.org/06/278406/3/check/gate-tempest-dsvm-layer4/40648e5/console.html.gz
[2]: https://github.com/openstack/tempest/blob/22ed3002f2e658cbccf08e02ff36f8994838b699/tempest/api/orchestration/stacks/templates/neutron_basic.yaml

Tags: gate-failure
Rico Lin (rico-lin)
description: updated
Angus Salkeld (asalkeld)
tags: added: gate-failure
Changed in heat:
importance: Undecided → High
Revision history for this message
Dmitriy (duvarenkov) wrote :

Could not reproduce this bug. Maybe its already fixed?

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

I doubt it. The test is now skipped in the Gate [1] so it's hard to gather data.

It would be great, if He

[1]: https://github.com/openstack/tempest/commit/e16d08edff7c4a37e359fd0a3cadfcc005db745d

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

It would be great, if Heat could take back ownership of these tempest tests (using a tempest plugin). Now Tempest can''t enable this flaky test because if will hurt a lot of projects.

Zane Bitter (zaneb)
Changed in heat:
status: New → Triaged
Revision history for this message
Zane Bitter (zaneb) wrote :

We have some code to work around Neutron's horrible API and add an implicit dependency on any Subnets on the same network:

http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resources/openstack/nova/server.py?h=stable%2Fmitaka#n1045

What I suspect has happened is that the new translation rules to get rid of the separate 'network' and 'network_id' properties in OS::Neutron::Subnet mean that:

                subnet_net = (res.properties.get(subnet.Subnet.NETWORK_ID)
                              or res.properties.get(subnet.Subnet.NETWORK))

is no longer the correct logic. Note that the equivalent in OS::Neutron::Port has been changed:

http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resources/openstack/neutron/port.py?h=stable%2Fmitaka#n380

The race is probably very hard to reproduce, but it should be easy to detect if the dependency is being added correctly with some judiciously-placed logging.

> It would be great, if Heat could take back ownership of these tempest tests (using a tempest plugin).

I agree, but unfortunately the Tempest reviewers amongst others have prevailed upon the TC to take the opposite view: https://review.openstack.org/#/c/312718/

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

An idea to reproduce: try to revert https://github.com/openstack/tempest/commit/e16d08edff7c4a37e359fd0a3cadfcc005db745d and recheck 10 times to see if that test still fails sometimes. But that's time consuming.

Revision history for this message
Zane Bitter (zaneb) wrote :

I'm not worried about reproducing. Either the dependency is added and there's no race, or it isn't and there's a race. It's trivial to check whether the dependency is there - if it wasn't and we fix it so it is then there's no need to try to run a lot of races and see if we ever lose.

Changed in heat:
milestone: none → newton-2
Revision history for this message
Crag Wolfe (cwolfe) wrote :

The correct choice ideally is the non-deprecated res.properties.get(subnet.Subnet.NETWORK), and s/network_id/network/ in http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/orchestration/stacks/templates/neutron_basic.yaml#n27 .

However, I'm having trouble seeing how just using .NETWORK could reliably work. res.properties.get(subnet.Subnet.NETWORK) will return the network id, which implies it must have been created in neutron to get a non-None value.

Assume that no networks or subnets already exists for a given template. When we are building dependencies for a OS::Nova::Server that has a subnet, we may not be able to resolve the subnet's network id (i.e., we are adding dependencies before the physical resource for the network has been created), so that dependency is never created in http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resources/openstack/nova/server.py?h=stable%2Fmitaka#n1059 .

The best we can do here may be to add all networks in the template as a dependency if a server has a subnet where res.properties.get(subnet.Subnet.NETWORK) == None. (That also implies that subnets and networks should never depend on a server to avoid loops, so additional validation is probably a good idea)

Revision history for this message
Zane Bitter (zaneb) wrote :

I think res.properties.get(subnet.Subnet.NETWORK) can return either a name or an ID, can't it?

But you're right, if it's an ID and it's created in the template then it will be None. Meaning we'll probably end up with a dependency on all Subnets (which is not terrible in most cases).

The better placeholder stuff for validation we talked about at the summit should help improve this in future. This whole area is a mess at the moment - see https://review.openstack.org/#/c/289371/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/333062

Changed in heat:
assignee: nobody → Crag Wolfe (cwolfe)
status: Triaged → In Progress
Revision history for this message
Crag Wolfe (cwolfe) wrote :

TL;DR Yes, I still think we need a dependency on (potentially) all the Subnets.

I've dug in with some debug statements for
the existing code (against master, not against my patch) for the
tempest
tempest.api.orchestration.stacks.test_neutron_resources.NeutronResourcesTestJSON
case. My patch with the debug statements: https://github.com/cwolferh/heat-scratch/commit/b1f866884ab6d21868d5f3c0e610b1b3f5d8fb4a
The debug output: http://paste.openstack.org/show/521845/

A couple of observations:

1. The Server resource is not able to determine a subnet's network
before that Server is actually created when it is going through
add_dependencies(). (Look at "os_nova_server: subnet_net" debug
statements)

2. The Server ends up getting created after the Subnet (at least in
this particular case), but that is not because the Server added the
Subnet as dependency.

So, I feel like something is off here in the way we are building the
graph of resources to create -- the server should have added the
subnet dependency before it was created. That is what motivates my
current patch, though maybe it only needs to be as defensive as it is
during INIT (I say maybe because I don't have great faith in
finder='find_resourceid_by_name_or_id' if there is a transient issue
with neutron).

Note that later, *after* the Server and Subnet have been created, we
do see expected values in the "os_nova_server: subnet_net" debug
statement and the implicit dependency is added, but... we've already
created the resource by then...

Aside: I also tried s/network_id/network/ here:
https://review.openstack.org/gitweb?p=openstack/tempest.git;a=blob;f=tempest/api/orchestration/stacks/templates/neutron_basic.yaml;hb=HEAD#l27
with the same result.

Revision history for this message
Crag Wolfe (cwolfe) wrote :

I take back my INIT comment above -- the existing review should do the right thing in the INIT case when we can't resolve networks, or other cases where we can (or maybe even other cases where should but don't due to a transient neutron issue, for instance).

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

The subnet property has been available since Liberty for OS::Nova::Server, so I think its time we considered deprecating the network property now that Kilo is EOL.

This means we could fix the template test just by adding a subnet reference. I'll propose a change which does that.

Revision history for this message
Crag Wolfe (cwolfe) wrote :

I agree that will fix the template test. It doesn't fix the issue that template test raises though if a user happens to create a template similar to the current one. If we want that kind of template to fail validation, maybe that should be another bug. I'm not totally sure it should be considered an invalid case though -- the user may not really care which subnet the server gets associated with if there are multiple subnets, and the user may want to create the network/subnets/server all in one template.

Revision history for this message
Steve Baker (steve-stevebaker) wrote : Re: [Bug 1546201] Re: Race condition: boot a server on a network created by Heat

On 28/06/16 04:06, Crag Wolfe wrote:
> I agree that will fix the template test. It doesn't fix the issue that
> template test raises though if a user happens to create a template
> similar to the current one. If we want that kind of template to fail
> validation, maybe that should be another bug. I'm not totally sure it
> should be considered an invalid case though -- the user may not really
> care which subnet the server gets associated with if there are multiple
> subnets, and the user may want to create the network/subnets/server all
> in one template.
>
I still think we should fix the bug, but I also think we should signal
to the user to avoid the network property by deprecating it.

And I think the tempest test should do what we recommend, which is
explicitly specify what subnet they want to connect to.

Thomas Herve (therve)
Changed in heat:
milestone: newton-2 → newton-3
Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/333062
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=668ceb6cbfe96624a20f138e53cc1555440f3a2b
Submitter: Jenkins
Branch: master

commit 668ceb6cbfe96624a20f138e53cc1555440f3a2b
Author: Crag Wolfe <email address hidden>
Date: Wed Jun 22 18:58:10 2016 -0400

    Careful with OS::Nova::Server dependencies on subnets

    A server within a template may depend on networks and, implicitly,
    subnets, created within the same template. When we are calculating
    dependencies for the server, play it safe and add any subnets as
    dependencies whose network is not yet known. Note, this defensiveness
    only takes places when subnets specify their network by reference,
    rather by network id or name.

    Change-Id: Ifda0a70b40e6f255b9ee462e8af7f11a0b2d5d6c
    Closes-Bug: #1546201

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0.0rc1

This issue was fixed in the openstack/heat 7.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.