Rescue fails with 'Failed to terminate process: Device or resource busy' in the n-cpu log

Bug #1353939 reported by Attila Fazekas
84
This bug affects 17 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Juno
Won't Fix
Undecided
Unassigned
Kilo
Fix Released
Undecided
Unassigned
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Juno
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

[Impact]

 * Users may sometimes fail to shutdown an instance if the associated qemu
   process is on uninterruptable sleep (typically IO).

[Test Case]

 * 1. create some IO load in a VM
   2. look at the associated qemu, make sure it has STAT D in ps output
   3. shutdown the instance
   4. with the patch in place, nova will retry calling libvirt to shutdown
      the instance 3 times to wait for the signal to be delivered to the
      qemu process.

[Regression Potential]

 * None

message: "Failed to terminate process" AND message:'InstanceNotRescuable' AND message: 'Exception during message handling' AND tags:"screen-n-cpu.txt"

The above log stash-query reports back only the failed jobs, the 'Failed to terminate process' close other failed rescue tests,
but tempest does not always reports them as an error at the end.

message: "Failed to terminate process" AND tags:"screen-n-cpu.txt"

Usual console log:
Details: (ServerRescueTestJSON:test_rescue_unrescue_instance) Server 0573094d-53da-40a5-948a-747d181462f5 failed to reach RESCUE status and task state "None" within the required time (196 s). Current status: SHUTOFF. Current task state: None.

http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-full/90726cb/console.html#_2014-08-07_03_50_26_520

Usual n-cpu exception:
http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-full/90726cb/logs/screen-n-cpu.txt.gz#_2014-08-07_03_32_02_855

2014-08-07 03:32:02.855 ERROR oslo.messaging.rpc.dispatcher [req-39ce7a3d-5ceb-41f5-8f9f-face7e608bd1 ServerRescueTestJSON-2035684545 ServerRescueTestJSON-1017508309] Exception during message handling: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 408, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 88, in wrapped
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher payload)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/exception.py", line 71, in wrapped
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 292, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher pass
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 278, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 342, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 320, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 308, in decorated_function
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/compute/manager.py", line 3094, in rescue_instance
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher reason=_("Driver Error: %s") % unicode(e))
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher InstanceNotRescuable: Instance 0573094d-53da-40a5-948a-747d181462f5 cannot be rescued: Driver Error: Failed to terminate process 26425 with SIGKILL: Device or resource busy
2014-08-07 03:32:02.855 22829 TRACE oslo.messaging.rpc.dispatcher

Related branches

Tracy Jones (tjones-i)
tags: added: compute
Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: - Rescue failes with 'Failed to terminate process' in the n-cpu log
+ Rescue fails with 'Failed to terminate process: Device or resource busy'
+ in the n-cpu log
Revision history for this message
Matt Riedemann (mriedem) wrote :

"Error from libvirt during destroy. Code=38 Error=Failed to terminate process 26425 with SIGKILL: Device or resource busy" shows up in a lot of other test failures, not just the rescue tests, but doesn't always result in a failure.

From the logs here:

http://logs.openstack.org/82/107982/2/gate/gate-tempest-dsvm-postgres-full/90726cb/logs/screen-n-cpu.txt.gz?

It looks like there are other snapshot operations taking place at the same time for other concurrently running tests.

Joe Gordon (jogo)
tags: added: testing
Revision history for this message
Joe Gordon (jogo) wrote :

query: message:"SIGKILL\: Device or resource busy" AND tags:"screen-n-cpu.txt"

Has a lot of hits but some on successful runs

tags: added: libvirt
Changed in nova:
status: New → Confirmed
Revision history for this message
Sean Dague (sdague) wrote :

Seen in gate, though only 3 times in 10 days, so marking as high.

Changed in nova:
importance: Undecided → High
Revision history for this message
Daniel Berrange (berrange) wrote :
Download full text (4.8 KiB)

More useful is the stack trace of the original error, rather than the RPC error

2014-08-07 03:31:46.400 ERROR nova.virt.libvirt.driver [req-39ce7a3d-5ceb-41f5-8f9f-face7e608bd1 ServerRescueTestJSON-2035684545 ServerRescueTestJSON-1017508309] [instance: 0573094d-53da-40a5-948a-747d181462f5] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 26425 with SIGKILL: Device or resource busy
2014-08-07 03:31:46.400 ERROR nova.compute.manager [req-39ce7a3d-5ceb-41f5-8f9f-face7e608bd1 ServerRescueTestJSON-2035684545 ServerRescueTestJSON-1017508309] [instance: 0573094d-53da-40a5-948a-747d181462f5] Error trying to Rescue Instance
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] Traceback (most recent call last):
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/opt/stack/new/nova/nova/compute/manager.py", line 3088, in rescue_instance
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] rescue_image_meta, admin_password)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2279, in rescue
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] self._destroy(instance)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 936, in _destroy
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] instance=instance)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/opt/stack/new/nova/nova/openstack/common/excutils.py", line 82, in __exit__
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] six.reraise(self.type_, self.value, self.tb)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 908, in _destroy
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] virt_dom.destroy()
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 179, in doit
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] result = proxy_call(self._autowrap, f, *args, **kwargs)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 139, in proxy_call
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-747d181462f5] rv = execute(f,*args,**kwargs)
2014-08-07 03:31:46.400 22829 TRACE nova.compute.manager [instance: 0573094d-53da-40a5-948a-74...

Read more...

Revision history for this message
Daniel Berrange (berrange) wrote :

One possible option would be to simply retry the libvirt destroy() call a few times when we see the exception with err.code = VIR_ERR_SYSTEM_ERROR and err.int1 == EBUSY

Revision history for this message
Matt Thompson (mattt416) wrote :

We appear to be hitting this bug in our environment when running stable/juno by doing the following:

- boot instance w/ volume (nova boot --block-device source=image,id=<image id>,dest=volume,size=10,shutdown=remove,bootindex=0 ...)
- whilst instance is active and up, rebuild instance (nova reimage ...)

The later, when issued w/ --poll, returns:

Error rebuilding server
ERROR (InstanceInErrorState): Failed to terminate process 10947 with SIGKILL: Device or resource busy

Interestingly, I tried the above w/ an environment deployed w/ stable/icehouse and was not able to replicate the issue.

I'm happy to provide any further information if required!

Thanks,

--Matt

Revision history for this message
Attila Fazekas (afazekas) wrote :

If "Setting instance vm_state to ERROR" appears in the log http://logs.openstack.org/41/145041/2/gate/gate-tempest-dsvm-neutron-full/b43d7c6/logs/screen-n-cpu.txt.gz#_2015-01-15_17_39_14_119, the issue should be considered triggered even if the gate job SUCCEEDS.

The end user would see the VM in ERROR state.

Revision history for this message
Attila Fazekas (afazekas) wrote :

The issue has bigger impact with nova network!

+--------------------------------------+-------+--------+------------+-------------+------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------+--------+------------+-------------+------------------+
| 0c854e99-7ff2-4aeb-9b7b-5e32f42b500b | test | ACTIVE | - | Running | private=10.1.0.2 |
| 3a4c4f39-8283-4b69-9fd4-0a93a4984f40 | test | ERROR | - | Running | private=10.1.0.2 |
| 4b143b19-3175-4313-8d3a-a1eb8520b70d | test3 | ACTIVE | - | Running | private=10.1.0.3 |
| bc41e35e-76ad-4e43-9903-577f909ff1e2 | test3 | ERROR | - | Running | private=10.1.0.3 |
+--------------------------------------+-------+--------+------------+-------------+------------------+

If on this location an exception raised the IP become not allocated, but the security group rules/iptables are remaining there.
A new instance can be created with the same fixedip which will use the previous/ERRORed instances filtering rules,
instead of his own ones.

Chain nova-compute-local (1 references)
target prot opt source destination
nova-compute-inst-36 all -- anywhere 10.1.0.2
nova-compute-inst-37 all -- anywhere 10.1.0.3
nova-compute-inst-38 all -- anywhere 10.1.0.3

If your env does not able to produce this issue normally, you can have it artificially:

diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 5aea859..c037542 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -613,6 +613,9 @@ class LibvirtDriver(driver.ComputeDriver):
         if virt_dom is not None:
             try:
                 old_domid = virt_dom.ID()
+ err = libvirt.libvirtError("Trigger")
+ err.err = [libvirt.VIR_FROM_STREAMS,]
+ raise err
                 virt_dom.destroy()

             except libvirt.libvirtError as e:

After the patch applied restart the n-cpu and:
$ nova boot test3 --image cirros-0.3.2-x86_64-uec --flavor 42
$ nova delete test3 # goes to ERROR state
$ nova boot test3 --image cirros-0.3.2-x86_64-uec --flavor 42

The tempest 'tox -eall test_rescued_vm_detach_volume' test can lead to the Code=38/ VIR_FROM_STREAMS exceptions
http://logs.openstack.org/87/139287/14/check/check-tempest-dsvm-f21/5f3d210/logs/screen-n-cpu.txt.gz#_2015-01-16_03_27_05_790

Revision history for this message
Pavel Boldin (pboldin) wrote :

Could you please provide 'dmesg' output after the crash?

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

I have the same issue with Openstack Juno, Ubuntu 12.04 (3.13.0-40-generic, kvm 2.0.0+dfsg-2ubuntu1.9). I was running a long stress test, booting instance backed by volume and then deleting the instance, in a 'while true' loop.

After a while I hit this issue.

nova-compute.log
-----------------------------------
2015-04-09 14:58:20.987 8014 TRACE [..] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 918, in destroy
2015-04-09 14:58:20.987 8014 TRACE [..] if ret == -1: raise libvirtError ('virDomainDestroy() failed', dom=self)
2015-04-09 14:58:20.987 8014 TRACE [..] libvirtError: Failed to terminate process 28634 with SIGKILL: Device or resource busy

libvirtd.log
-------------------------------------
2015-04-09 14:58:19.815: 27669: error : virProcessKillPainfully:320 : Failed to terminate process 28634 with SIGKILL: Device or resource busy
2015-04-09 14:58:37.864: 27673: error : virProcessGetAffinity:433 : cannot get CPU affinity of process 28639: No such process
2015-04-09 14:58:41.433: 27662: error : qemuMonitorIO:656 : internal error: End of file from monitor

dmesg
--------------------------------------
<6>Apr 9 14:57:19 node-16 kernel: [612547.695013] kvm: zapping shadow pages for mmio generation wraparound
<6>Apr 9 14:58:41 node-16 kernel: [612630.160522] qbrb4c910ab-72: port 2(tapb4c910ab-72) entered disabled state
<6>Apr 9 14:58:41 node-16 kernel: [612630.160736] device tapb4c910ab-72 left promiscuous mode
<6>Apr 9 14:58:41 node-16 kernel: [612630.160758] qbrb4c910ab-72: port 2(tapb4c910ab-72) entered disabled state

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (6.8 KiB)

I see this on instance create sometimes in our internal CI running in nested virt (qemu) on RHEL 7.1.

This is the error from the nova compute logs:

2015-05-09 17:09:35.298 4449 ERROR nova.compute.manager [req-f3ae72f5-6139-4031-b527-e506e321f4ab - - - - -] [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] Instance failed to spawn
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] Traceback (most recent call last):
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2501, in _build_resources
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] yield resources
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2365, in _build_and_run_instance
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] block_device_info=block_device_info)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2354, in spawn
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] block_device_info=block_device_info)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4391, in _create_domain_and_network
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] power_on=power_on)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4322, in _create_domain
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] LOG.error(err)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] six.reraise(self.type_, self.value, self.tb)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4312, in _create_domain
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] domain.createWithFlags(launch_flags)
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 183, in doit
2015-05-09 17:09:35.298 4449 TRACE nova.compute.manager [instance: 6e8122a8-0a18-408c-8518-5ed2f7470def] result = proxy_call(self._autowrap, f...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/181781

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
tags: added: kilo-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/181781
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3907867601d1044eaadebff68a590d176abff6cf
Submitter: Jenkins
Branch: master

commit 3907867601d1044eaadebff68a590d176abff6cf
Author: Matt Riedemann <email address hidden>
Date: Sun May 10 18:46:37 2015 -0700

    libvirt: handle code=38 + sigkill (ebusy) in destroy()

    Handle the libvirt error during destroy when the sigkill fails due to an
    EBUSY. This is taken from a comment by danpb in the bug report as a
    potential workaround.

    Co-authored-by: Daniel Berrange (<email address hidden>)

    Closes-Bug: #1353939

    Change-Id: I128bf6b939fbbc85df521fd3fe23c3c6f93b1b2c

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.1 KiB)

Conversation in IRC about how ceilometer's compute- agent pollster is hitting libvirt directly and that might be contributing to the EBUSY here:

(9:09:07 AM) sdague: there might be some additional libvirt bits not exposed, but we've actually told people that you shouldn't be talking to libvirt directly on a nova worker
(9:09:46 AM) zz_dimtruck is now known as dimtruck
(9:10:00 AM) cdent: we've been over this before. the ceilometer-compute-agent breaks all the rules and has done for a long time and for the time being it needs to until you and I find the free time to get nova to do the notifications that we've discussed wanting to do
(9:10:59 AM) openstackgerrit: Sergey Skripnick proposed openstack-infra/project-config: Remove double curly brackets from cue builder https://review.openstack.org/192693
(9:11:16 AM) mriedem: fungi: we're looking to do the final icehouse release, tag icehouse-eol and drop the stable/icehouse branch - i see with the chef cookbook rename you have several changes to update .gitreview in those repos on stable/icehouse: https://review.openstack.org/#/q/status:open+branch:stable/icehouse,n,z
(9:11:27 AM) mriedem: fungi: do we care about those? i'm also asking in #openstack-chef
(9:11:58 AM) cdent: sdague: those items listed here as pollsters are some of those that are the result of the libvirt (and other VIRT_DRIVER) -based polling: http://docs.openstack.org/admin-guide-cloud/content/section_telemetry-compute-meters.html
(9:12:50 AM) cdent: I agree that this is not how things should be, but they are what they are right now.
(9:13:04 AM) cdent: So: a working devstack for ceilometer at least _wants_ libvirt
(9:13:59 AM) cdent: What I'd like to do is make it acquire that in the least ugly and most flexible way. As far as I can tell from this discussion that means insuring that libvirt package is present before asking to install libvirt-python?
(9:14:56 AM) fungi: mriedem: no, i pushed them with a script for all branches of all repos we renamed. feel free to abandon the ones you don't want and approve the ones you do want
(9:15:06 AM) mriedem: fungi: ok, thanks
(9:16:34 AM) sdague: cdent: so #3 is at least one path. I didn't realize this was polling libvirt directly, and I kind of wonder if that might be exasserbating https://bugs.launchpad.net/nova/+bug/1353939
(9:16:36 AM) openstack: Launchpad bug 1353939 in OpenStack Compute (nova) "Rescue fails with 'Failed to terminate process: Device or resource busy' in the n-cpu log" [High,Fix committed] - Assigned to Matt Riedemann (mriedem)
(9:17:26 AM) sdague: because one of the reasons we tell people not to talk to libvirt, is it's stability level seems to go down with multiple concurrent clients.
(9:18:13 AM) mriedem: sdague: might want to capture that in the bug report b/c we pushed up a change to try and help that bug but it didn't really, at least with libvirt 1.2.2
(9:18:20 AM) openstackgerrit: Merged openstack-infra/project-config: Move akanda functional tests out of experimental https://review.openstack.org/190745
(9:18:21 AM) mriedem: and danpb didn't see any reason why
(9:18:31 AM) cdent: Good questions sdague. ceilo has been doing this kind of polling for well over a...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/193497

Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/193497
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc6af6bf861b510834122aa75750fd784578e197
Submitter: Jenkins
Branch: stable/kilo

commit dc6af6bf861b510834122aa75750fd784578e197
Author: Matt Riedemann <email address hidden>
Date: Sun May 10 18:46:37 2015 -0700

    libvirt: handle code=38 + sigkill (ebusy) in destroy()

    Handle the libvirt error during destroy when the sigkill fails due to an
    EBUSY. This is taken from a comment by danpb in the bug report as a
    potential workaround.

    Co-authored-by: Daniel Berrange (<email address hidden>)

    Closes-Bug: #1353939

    Conflicts:
        nova/tests/unit/virt/libvirt/test_driver.py

        NOTE (kashyapc): 'stable/kilo' branch doesn't have the
        'libvirt_guest' object, so, adjust the below unit tests accordingly:

            test_private_destroy_ebusy_timeout
            test_private_destroy_ebusy_multiple_attempt_ok

    Change-Id: I128bf6b939fbbc85df521fd3fe23c3c6f93b1b2c
    (cherry picked from commit 3907867601d1044eaadebff68a590d176abff6cf)

tags: added: in-stable-kilo
Liang Chen (cbjchen)
description: updated
tags: added: patch
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/221529

Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-1 → 12.0.0
Changed in nova (Ubuntu):
status: New → Invalid
Changed in cloud-archive:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/221529
Reason: Juno is EOL soon.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

This fix was made available in 1:2014.2.4-0ubuntu1~cloud4 of nova in the Ubuntu Cloud Archive for Juno.

Revision history for this message
Jeffrey Guan (double12gzh) wrote :

I also met with such issue:

Error: Resource DELETE failed: Error: Resource DELETE failed: Error: Server vm001 delete failed: (500) Failed to terminate process 17320 with SIGKILL: Device or resource busy

Revision history for this message
Jeffrey Guan (double12gzh) wrote :

Delete the in-progress vm on dashboard and then it works.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/639091

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/639091
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=10d50ca4e210039aeae84cb9bd5d18895948af54
Submitter: Zuul
Branch: master

commit 10d50ca4e210039aeae84cb9bd5d18895948af54
Author: Kashyap Chamarthy <email address hidden>
Date: Mon Feb 25 13:26:24 2019 +0100

    libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away. Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete. It now waits for 30 seconds for
    SIGKILL to work (instead of 5). Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

    In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect. And it matches the (more graceful) behaviour of
        libvirt v4.7.0. This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

    Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/667389

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/667436

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/667389
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=118cf0c5920b5ff41333fd304a934dd7e5f4e1a8
Submitter: Zuul
Branch: stable/stein

commit 118cf0c5920b5ff41333fd304a934dd7e5f4e1a8
Author: Kashyap Chamarthy <email address hidden>
Date: Mon Feb 25 13:26:24 2019 +0100

    libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away. Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete. It now waits for 30 seconds for
    SIGKILL to work (instead of 5). Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

    In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect. And it matches the (more graceful) behaviour of
        libvirt v4.7.0. This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

    Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/668111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/667436
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=75985e25bc147369efb90d4fa9f046631766c14c
Submitter: Zuul
Branch: stable/rocky

commit 75985e25bc147369efb90d4fa9f046631766c14c
Author: Kashyap Chamarthy <email address hidden>
Date: Mon Feb 25 13:26:24 2019 +0100

    libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away. Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete. It now waits for 30 seconds for
    SIGKILL to work (instead of 5). Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

    In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect. And it matches the (more graceful) behaviour of
        libvirt v4.7.0. This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

    Conflicts:
        nova/virt/libvirt/driver.py

        (Trivial conflict: Rocky didn't have the QEMU-native TLS feature
        yet.)

    Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/668111
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16b0fb01afac7b81094aa891588f4bd9017ee235
Submitter: Zuul
Branch: stable/queens

commit 16b0fb01afac7b81094aa891588f4bd9017ee235
Author: Kashyap Chamarthy <email address hidden>
Date: Mon Feb 25 13:26:24 2019 +0100

    libvirt: Rework 'EBUSY' (SIGKILL) error handling code path

    Change ID I128bf6b939 (libvirt: handle code=38 + sigkill (ebusy) in
    _destroy()) handled the case where a QEMU process "refuses to die" within
    a given timeout period set by libvirt.

    Originally, libvirt sent SIGTERM (allowing the process to clean-up
    resources), then waited 10 seconds, if the guest didn't go away. Then
    it sent, the more lethal, SIGKILL and waited another 5 seconds for it to
    take effect.

    From libvirt v4.7.0 onwards, libvirt increased[1][2] the time it waits
    for a guest hard shutdown to complete. It now waits for 30 seconds for
    SIGKILL to work (instead of 5). Also, additional wait time is added if
    there are assigned PCI devices, as some of those tend to slow things
    down.

    In this change:

      - Increment the counter to retry the _destroy() call from 3 to 6, thus
        increasing the total time from 15 to 30 seconds, before SIGKILL
        takes effect. And it matches the (more graceful) behaviour of
        libvirt v4.7.0. This also gives breathing room for Nova instances
        running in environments with large compute nodes with high instance
        creation or delete churn, where the current timout may not be
        sufficient.

      - Retry the _destroy() API call _only_ if MIN_LIBVIRT_VERSION is lower
        than 4.7.0.

    [1] https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=9a4e4b9
        (process: wait longer 5->30s on hard shutdown)
    [2] https://libvirt.org/git/?p=libvirt.git;a=commit;h=be2ca04 ("process:
        wait longer on kill per assigned Hostdev")

    Conflicts:
        nova/virt/libvirt/driver.py

        (Trivial conflict: Rocky didn't have the QEMU-native TLS feature
        yet.)

    Conflicts (stable/queens):
        nova/tests/unit/virt/libvirt/test_driver.py

    Related-bug: #1353939

    Change-Id: If2035cac931c42c440d61ba97ebc7e9e92141a28
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    (cherry picked from commit 10d50ca4e210039aeae84cb9bd5d18895948af54)
    (cherry picked from commit 75985e25bc147369efb90d4fa9f046631766c14c)

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.