libvirt: Failure to recover from failed detach

Bug #1836212 reported by Matthew Booth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c 9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 390, in _try_detach_device\n self.detach_device(conf, persistent=persistent, live=live)\n', ' File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 467, in detach_device\n self._domain.detachDeviceFlags(device_xml, flags=flags)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit\n result = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call\n rv = execute(f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute\n six.reraise(c, e, tb)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker\n rv = meth(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1194, in detachDeviceFlags\n if ret == -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\', dom=self)\n', 'libvirtError: invalid argument: no target device vdb\n']

This appears to happen because when we call detach_device_with_retry(live=True) we ultimately call detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated when libvirt failed to remove the device from CONFIG (persistent). This can happen because detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the CONFIG domain as long as the LIVE domain removal was queued, even though this is an asynchronous operation. Consequently, a subsequent check for the device may return the device because it hasn't yet been (and may never be) removed from the LIVE domain, but it has been removed from the CONFIG domain. This will prevent libvirt from attempting to remove the device from the LIVE domain, and so the detach will never succeed.

Revision history for this message
Matthew Booth (mbooth-9) wrote :
tags: added: libvirt
Revision history for this message
Matthew Booth (mbooth-9) wrote :

The above error was addressed in change I09230fc47b0950aa5a3db839a070613c9c817576 . However, the above error was generated by a nova compute containing this patch, suggesting the fix was incomplete.

The fix does contain a race which could still result in the above error being raised, though: in _do_wait_and_retry_detach get_device_conf_func() might return the present device, which is then deleted before the subsequent call to detachDeviceFlags. I've written a failing unit test to demonstrate this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/671023

Revision history for this message
melanie witt (melwitt) wrote :

To summarize my private comment from the BZ linked earlier, I think the logged traceback is a red herring and that there is not actually a bug in the detach_device_with_retry code.

In our code, we can see that any detach attempt that results in a "not found" exception is caught and DeviceNotFound is raised instead (which is then ignored by the driver and compute manager code):

https://github.com/openstack/nova/blob/15e082a/nova/virt/libvirt/guest.py#L405-L422

So, in the log excerpt in the bug report, we are looking at a case where we *did* correctly handle a "not found" condition and raised DeviceNotFound. We are seeing the traceback because the oslo_utils excutils.save_and_reraise_exception context manager logs the original exception when the original exception is *not* re-raised and instead a new exception is raised:

https://github.com/openstack/oslo.utils/blob/1b8bafb391b82665b99b3da8205402ec2cd495d1/oslo_utils/excutils.py#L211-L217

So that traceback is not at all related to the failure described in the BZ.

An improvement I think we could do is to suppress the logging of the dropped exception traceback by oslo_utils, to prevent being misled in the future. We could use a technique like this:

https://github.com/openstack/nova/blob/15e082a2a87c87d2d30d13936bbda06489b4edcc/nova/compute/manager.py#L8459-L8473

to conditionally tell the context manager whether to re-raise the original exception. That way it will not log a traceback when we decide to raise a new exception.

Revision history for this message
melanie witt (melwitt) wrote :

Here's a neutron bug where they were seeing similar "Original exception being dropped" messages in their logs in cases where the excutils.save_and_reraise_exception context manager was not being allowed to re-raise its original exception (because a new exception was being raised instead):

https://bugs.launchpad.net/neutron/+bug/1669244

Revision history for this message
Matthew Booth (mbooth-9) wrote :

Yep. The actual error thrown was "Unable to detach from guest transient domain.", which is now "Unable to detach the device from the live config." in master. That RetryDecorator makes this function a whole lot harder to read, but with your explanation it seems that the detach was actually timing out, which is consistent with the underlying problem we eventually discovered.

Thanks! I'll close this out.

Changed in nova:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matthew Booth (<email address hidden>) on branch: master
Review: https://review.opendev.org/671023
Reason: As discussed in the bug, this error is already correctly handled elsewhere.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/671640

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/671640
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=738774b942dab21d8c03c70b4e74b6a487515ff5
Submitter: Zuul
Branch: master

commit 738774b942dab21d8c03c70b4e74b6a487515ff5
Author: melanie witt <email address hidden>
Date: Thu Jul 18 23:59:16 2019 +0000

    Avoid logging traceback when detach device not found

    We use the oslo.utils save_and_reraise_exception context manager in our
    detach device code and catch specific exceptions that mean 'not found'
    and raise DeviceNotFound instead. When we do that, the
    save_and_reraise_exception context manager logs an ERROR traceback of
    the original exception, for informational purposes. This is misleading
    when trying to debug other issues, as it makes it look like the caught
    exception caused a problem.

    This passes the reraise=False keyword arg to the context manager and
    sets the 'reraise' attribute to True only if we are not going to raise
    a different exception.

    Related-Bug: #1836212

    Change-Id: Icce1e31fe3ebcbf9e4897bbfa57b7f3d1fba67a3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/672833

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/672833
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d83342226470e1957525978d7bec7d98db9bb695
Submitter: Zuul
Branch: stable/stein

commit d83342226470e1957525978d7bec7d98db9bb695
Author: melanie witt <email address hidden>
Date: Thu Jul 18 23:59:16 2019 +0000

    Avoid logging traceback when detach device not found

    We use the oslo.utils save_and_reraise_exception context manager in our
    detach device code and catch specific exceptions that mean 'not found'
    and raise DeviceNotFound instead. When we do that, the
    save_and_reraise_exception context manager logs an ERROR traceback of
    the original exception, for informational purposes. This is misleading
    when trying to debug other issues, as it makes it look like the caught
    exception caused a problem.

    This passes the reraise=False keyword arg to the context manager and
    sets the 'reraise' attribute to True only if we are not going to raise
    a different exception.

    Related-Bug: #1836212

    Change-Id: Icce1e31fe3ebcbf9e4897bbfa57b7f3d1fba67a3
    (cherry picked from commit 738774b942dab21d8c03c70b4e74b6a487515ff5)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/674068

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/674068
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a1be7afd0f11a2123c59b0ddea63f08bb4436ce
Submitter: Zuul
Branch: stable/rocky

commit 3a1be7afd0f11a2123c59b0ddea63f08bb4436ce
Author: melanie witt <email address hidden>
Date: Thu Jul 18 23:59:16 2019 +0000

    Avoid logging traceback when detach device not found

    We use the oslo.utils save_and_reraise_exception context manager in our
    detach device code and catch specific exceptions that mean 'not found'
    and raise DeviceNotFound instead. When we do that, the
    save_and_reraise_exception context manager logs an ERROR traceback of
    the original exception, for informational purposes. This is misleading
    when trying to debug other issues, as it makes it look like the caught
    exception caused a problem.

    This passes the reraise=False keyword arg to the context manager and
    sets the 'reraise' attribute to True only if we are not going to raise
    a different exception.

    Related-Bug: #1836212

    Change-Id: Icce1e31fe3ebcbf9e4897bbfa57b7f3d1fba67a3
    (cherry picked from commit 738774b942dab21d8c03c70b4e74b6a487515ff5)
    (cherry picked from commit d83342226470e1957525978d7bec7d98db9bb695)

tags: added: in-stable-rocky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.