Instance stuck in reboot on libvirt failure

Bug #1002814 reported by Mandar Vaze
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Vish Ishaya

Bug Description

Lets say instance reboot is in progress. At one point, libvirt driver is asked to reboot the actual VM. Now in this case if the VM itself has disappeared, the instance will be stuck in rebooting forever

(This was accidentally discovered when libvirtd was killed using "sudo kill -9" when reboot was in progress. "virsh list" would also not list any instances)

Refer to following code snippet from nova/virt/libvirt/connection.py :

        def _wait_for_reboot():
            """Called at an interval until the VM is running again."""
            try:
                state = self.get_info(instance)['state']
            except exception.NotFound:
                LOG.error(_("During reboot, instance disappeared."),
                          instance=instance)
                raise utils.LoopingCallDone

            if state == power_state.RUNNING:
                LOG.info(_("Instance rebooted successfully."),
                         instance=instance)
                raise utils.LoopingCallDone

Here exception.NotFound block should NOT raise "utils.LoopingCallDone" (which indicates that operation successfully completed - and infinite loop in utils.LoopingCall is broken) Due to this, nova-compute never knows that VM has vanished

Instead it should just "raise" or "raise exception.NotFound". This will (hopefully) cascade the exception and nova-compute will catch and mark the instance as error.

Instances stuck in "rebooting" can't be deleted. Since VM has already disappeared, marking it as Error (Thus allowing delete) seems like correct solution.

There may be similar problems in _wait_for_boot(), _wait_for_running() etc.

Revision history for this message
Dan Smith (danms) wrote :

I'm not sure that changing the exception that is raised is really the fix, but I think there is probably some state cleanup that needs to be done after the loopingcall if it fails.

Targeting for folsom-rc1 since this is a state corruption bug that has the potential to block normal users, requiring them to manually poke their database.

Changed in nova:
importance: Undecided → Medium
milestone: none → folsom-rc1
status: New → Triaged
Changed in nova:
importance: Medium → High
assignee: nobody → Yun Mao (yunmao)
assignee: Yun Mao (yunmao) → Vish Ishaya (vishvananda)
Changed in nova:
status: Triaged → In Progress
Revision history for this message
Mark McLoughlin (markmc) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/12819
Committed: http://github.com/openstack/nova/commit/e6e5123cceb874a7ca6dcb16bc401f530439d07a
Submitter: Jenkins
Branch: master

commit e6e5123cceb874a7ca6dcb16bc401f530439d07a
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Sep 11 12:09:38 2012 -0700

    Allows waiting timers in libvirt to raise NotFound

    There are cases where an operation will fail when communicating with
    libvirt. We were eating the exception even though the operation
    failed, which has the potential to put the instance into an
    unrecoverable state.

    This patch allows NotFound exceptions to propogate up so that they
    are caught by the state handling code and the task state can be
    set to error.

    Fixes bug 1002814

    Change-Id: Iddc319b24aee0b7132155f50b9d3b0eee9bb3fa8

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.