baremetal driver needs a state between "building" and "deploying"

Bug #1184470 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
aeva black
OpenStack Compute (nova)
Fix Released
Medium
Sahid Orentino

Bug Description

It is not possible to tell from the baremetal node status that a deployment has failed because a machine's BIOS hung or was improperly configured. This would be discernable with an additional state change between BUILDING and DEPLOYING.

Details
=====

During a baremetal deployment, the state is tracked in the nova_bm.bm_nodes table. The state is set to BUILDING when virt/driver/baremetal.py:driver.spawn() acquires the node and begins preparing the deployment. After the power_driver's activate_node() method is called, the PXE driver goes into a wait loop to see when the deployment is done. The state is changed to DEPLOYING when baremetal-deploy-helper receives a connection from the deployment ramdisk, and then either set to DEPLOYDONE or DEPLOYFAIL, accordingly.

There is a middle step which is not currently represented. If the baremetal node powers on but never connects to the deploy-helper, it is impossible to tell from the database whether the deploy environment was not created or whether the machine is dead.

Proposed fix
==========

Add a PREPARED state to baremetal_states.py, and set the node to this state immediately after calling activate_node().

Tags: baremetal
aeva black (tenbrae)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: baremetal
aeva black (tenbrae)
Changed in nova:
milestone: none → havana-2
Changed in nova:
milestone: havana-2 → havana-3
Changed in nova:
milestone: havana-3 → none
Changed in nova:
assignee: nobody → sahid (sahid-ferdjaoui)
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

The code is now different, but I think your proposal is always good to add.

What do you think about to add this states at the just after the call:
https://github.com/openstack/nova/blob/master/nova/virt/baremetal/driver.py#L250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/50348

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: sahid (sahid-ferdjaoui) → nobody
Tom Fifield (fifieldt)
Changed in nova:
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → sahid (sahid-ferdjaoui)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
aeva black (tenbrae) wrote :

I took a look at the Ironic PXE driver's handling of this sort of situation, and while I think it's OK and not affected by the precise circumstances described in this bug, I think there may be some similar difficulty in determining why a deploy failed part-way through.

I've tagged the bug as also-affecting and will look into it.

Changed in ironic:
assignee: nobody → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/50348
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce2d580106dc04315e71790c98afee062f87351b
Submitter: Jenkins
Branch: master

commit ce2d580106dc04315e71790c98afee062f87351b
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Tue Oct 8 13:25:49 2013 +0000

    Adds a PREPARED state after baremetal node power on.

    During a baremetal deployment there is a middle step which
    is not currently represented. If the baremetal node powers on
    but never connects to the deploy-helper, it is impossible to tell
    from the database whether the deploy environment was not created
    or whether the machine is dead.

    Change-Id: I6be3d45fee28970cbb02945c518be34b2bc74689
    Closes-Bug: #1184470

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/63037
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=972855e7314c95a07c8483b33138a7a2de8c371c
Submitter: Jenkins
Branch: master

commit 972855e7314c95a07c8483b33138a7a2de8c371c
Author: Devananda van der Veen <email address hidden>
Date: Wed Dec 18 16:57:58 2013 -0800

    Improve error handling in PXE _continue_deploy

    Related to bug 1184470, there was a concern that the PXE driver
    may not be adequately handling errors and informing users when failures
    occur mid-deploy.

    This patch refactors the _continue_deploy() method to handle both errors
    POSTed from the ramdisk and errors that originate within deploy_utils.

    It also fixes an inconsistency in the final provisioning_state:
    ConductorManager.do_node_deploy() will set provisioning_state = ACTIVE,
    however the PXE driver was leaving nodes with state = DEPLOYDONE.

    Change-Id: I29cbff87cbaf85d95687ae094720f8b99f33b65f
    Related-bug: 1184470

Changed in nova:
milestone: none → icehouse-2
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
aeva black (tenbrae) wrote :

Ironic uses a "wait-callback" state as an optional intermediary state if a deploy driver needs to wait for a callback from the node // deploy agent. Closing this bug for Ironic now.

Changed in ironic:
status: New → Fix Committed
importance: Undecided → Medium
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → icehouse-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-2 → 2014.1
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.