OpenStack Compute (nova)

Instance stuck in reboot on libvirt failure

Bug #1002814 reported by Mandar Vaze on 2012-05-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	High	Vish Ishaya	OpenStack Compute (nova) 2012.2 "folsom"

Bug Description

Lets say instance reboot is in progress. At one point, libvirt driver is asked to reboot the actual VM. Now in this case if the VM itself has disappeared, the instance will be stuck in rebooting forever

(This was accidentally discovered when libvirtd was killed using "sudo kill -9" when reboot was in progress. "virsh list" would also not list any instances)

Refer to following code snippet from nova/virt/libvirt/connection.py :

        def _wait_for_reboot():
            """Called at an interval until the VM is running again."""
            try:
                state = self.get_info(instance)['state']
            except exception.NotFound:
                LOG.error(_("During reboot, instance disappeared."),
                          instance=instance)
                raise utils.LoopingCallDone

            if state == power_state.RUNNING:
                LOG.info(_("Instance rebooted successfully."),
                         instance=instance)
                raise utils.LoopingCallDone

Here exception.NotFound block should NOT raise "utils.LoopingCallDone" (which indicates that operation successfully completed - and infinite loop in utils.LoopingCall is broken) Due to this, nova-compute never knows that VM has vanished

Instead it should just "raise" or "raise exception.NotFound". This will (hopefully) cascade the exception and nova-compute will catch and mark the instance as error.

Instances stuck in "rebooting" can't be deleted. Since VM has already disappeared, marking it as Error (Thus allowing delete) seems like correct solution.

There may be similar problems in _wait_for_boot(), _wait_for_running() etc.

Revision history for this message

Dan Smith (danms) wrote on 2012-09-07:

I'm not sure that changing the exception that is raised is really the fix, but I think there is probably some state cleanup that needs to be done after the loopingcall if it fails.

Targeting for folsom-rc1 since this is a state corruption bug that has the potential to block normal users, requiring them to manually poke their database.

Changed in nova:
importance:	Undecided → Medium
milestone:	none → folsom-rc1
status:	New → Triaged

Vish Ishaya (vishvananda) on 2012-09-11

Changed in nova:
importance:	Medium → High
assignee:	nobody → Yun Mao (yunmao)
assignee:	Yun Mao (yunmao) → Vish Ishaya (vishvananda)

OpenStack Infra (hudson-openstack) on 2012-09-11

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

Mark McLoughlin (markmc) wrote on 2012-09-11:

Review is https://review.openstack.org/12819

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-09-18: Fix merged to nova (master)

Reviewed: https://review.openstack.org/12819
Committed: http://github.com/openstack/nova/commit/e6e5123cceb874a7ca6dcb16bc401f530439d07a
Submitter: Jenkins
Branch: master

commit e6e5123cceb874a7ca6dcb16bc401f530439d07a
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Sep 11 12:09:38 2012 -0700

Allows waiting timers in libvirt to raise NotFound

    There are cases where an operation will fail when communicating with
    libvirt. We were eating the exception even though the operation
    failed, which has the potential to put the instance into an
    unrecoverable state.

    This patch allows NotFound exceptions to propogate up so that they
    are caught by the state handling code and the task state can be
    set to error.

Fixes bug 1002814

Change-Id: Iddc319b24aee0b7132155f50b9d3b0eee9bb3fa8