Race in resource tracker causes 500 response on deleting during verify_resize state

Bug #1371677 reported by Matthew Treinish
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Dan Smith

Bug Description

During a tempest run occasionally a during the
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_verify_resize_state
test it will fail when the test attempts to delete a server in the verify_resize state. The failure is caused by a 500 response given being returned from nova. Looking at the nova-api log this is caused by an rpc call never receiving a response:

http://logs.openstack.org/10/110110/40/check/check-tempest-dsvm-postgres-full/4cd8a81/logs/screen-n-api.txt.gz#_2014-09-19_10_24_07_221

looking at the n-cpu logs for the handling of that rpc call yields:

http://logs.openstack.org/10/110110/40/check/check-tempest-dsvm-postgres-full/4cd8a81/logs/screen-n-cpu.txt.gz#_2014-09-19_10_24_31_404

Which looks like it is coming from attempting to updating the resource tracker being triggered by the server deletion. However the volume from that failure according to the tempest log is coming from a different test, in the test class ServerRescueNegativeTestJSON. It appears the tearDownClass for that test class is running concurrently with the failed test, and causing a race in the resource tracker, where the volume it expects to be there disappears, so when it goes to get the size it fails.

Full logs for an example run that tripped this is here:
http://logs.openstack.org/10/110110/40/check/check-tempest-dsvm-postgres-full/4cd8a81

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah I was just looking at this coincidentally.

Changed in nova:
status: New → Confirmed
tags: added: compute
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
importance: Undecided → High
tags: added: volumes
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

So the volume is deleted within milliseconds of when the update_available_resource periodic task runs. We run that task every minute. We hit the failure and dump a stack trace in the logs.

As a partial bug fix we could catch the ProcessExecutionError and check the stderr for "No such device or address" and just log that as an info message. That would reduce the thousands of traces in the normal gate runs for an expected teardown issue.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Some investigation in IRC:

(2:26:59 PM) mriedem: here is when update_available_resource starts:
(2:26:59 PM) mriedem: 2014-09-19 10:23:00.539 21095 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /opt/stack/new/nova/nova/openstack/common/periodic_task.py:193
(2:27:05 PM) mriedem: 2014-09-19 10:23:00.540 21095 DEBUG nova.openstack.common.lockutils [-] Acquired semaphore "compute_resources" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
(2:27:05 PM) alaski: I'm looking about three lines up
(2:27:26 PM) mriedem: then we fail and release that lock
(2:27:26 PM) alaski: 2014-09-19 10:23:07.991 DEBUG nova.openstack.common.lockutils [req-5a467708-5da4-446b-b729-c44d005f84c3 DeleteServersTestJSON-14297441 DeleteServersTestJSON-1574463718] Using existing semaphore "compute_resources" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:202
(2:27:26 PM) mriedem: 2014-09-19 10:24:31.404 21095 DEBUG nova.openstack.common.lockutils [-] Semaphore / lock released "update_available_resource" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:275
(2:27:56 PM) mriedem: 10:23:00 to 10:24:31
(2:27:56 PM) alaski: 2014-09-19 10:24:31.623 DEBUG nova.openstack.common.lockutils [req-5a467708-5da4-446b-b729-c44d005f84c3 DeleteServersTestJSON-14297441 DeleteServersTestJSON-1574463718] Acquired semaphore "compute_resources" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
(2:27:56 PM) alaski: req-5a467708-5da4-446b-b729-c44d005f84c3 is what's timing out
(2:28:07 PM) mriedem: yeah so that's your 90 seconds
(2:28:13 PM) mriedem: where we have a 60 second messaging timeout in between
(2:29:16 PM) alaski: yep
(2:30:40 PM) alaski: it seems that update_available_resources needs to be reworked so that it doesn't hold the lock too long
(2:30:49 PM) mriedem: yeah

Changed in nova:
status: Confirmed → Triaged
Matt Riedemann (mriedem)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/122873

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/122874

Revision history for this message
Sean Dague (sdague) wrote :

Moving up to critical, as this is bouncing us quite a bit in the gate now

Changed in nova:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/122873
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9a570111d953f7b2a5bd2b4f47dda353f2aba78c
Submitter: Jenkins
Branch: master

commit 9a570111d953f7b2a5bd2b4f47dda353f2aba78c
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 19 12:49:31 2014 -0700

    Handle volume bdm not found in lvm.get_volume_size

    We're stack tracing in successful gate runs because of a race when
    deleting an instance and when the update_available_resource periodic
    task is running.

    When we hit the race, the volume that's backing the instance is deleted
    and the periodic task (virt driver) is trying to get the current
    information about the volume, which no longer exists and we fail and
    short-circuit the update_available_resource task.

    This change adds a decorator which checks if a ProcessExecutionError is
    due to the device no longer existing and if so, raises
    VolumeBDMPathNotFound, and applies that decorator to the get_volume_size
    method.

    Related-Bug: #1371677

    Change-Id: Ic452df82c61087efda8d742e030076421c0c813b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/122874
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=76bc4ac549b2c8b70b67458717e793defb048d2c
Submitter: Jenkins
Branch: master

commit 76bc4ac549b2c8b70b67458717e793defb048d2c
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 19 13:47:42 2014 -0700

    Handle VolumeBDMPathNotFound in _get_disk_over_committed_size_total

    This handles the VolumeBDMPathNotFound error bubbling up from
    _get_instance_disk_info during the update_available_resource periodic
    task.

    Related-Bug: #1371677

    Change-Id: I10c20f3f5f96003fe69fcdc9133c661e6dc009b4

Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :
Download full text (4.4 KiB)

I added some time testing for the update_available_resources periodic task:

http://logs.openstack.org/38/123438/1/check/check-tempest-dsvm-neutron-full/95c1ef2/logs/screen-n-cpu.txt.gz?

Just filter on error, can see that sometimes it takes a very long time to run:

2014-09-23 16:51:51.494 32200 ERROR nova.compute.manager [-] update_available_resource took 0.296182155609 secs
2014-09-23 16:52:40.761 32200 ERROR nova.compute.manager [-] update_available_resource took 0.17028093338 secs
2014-09-23 16:53:40.887 32200 ERROR nova.compute.manager [-] update_available_resource took 0.109967947006 secs
2014-09-23 16:54:41.639 32200 ERROR nova.compute.manager [-] update_available_resource took 0.105166912079 secs
2014-09-23 16:55:43.663 32200 ERROR nova.compute.manager [-] update_available_resource took 0.128235816956 secs
2014-09-23 16:56:45.131 32200 ERROR nova.compute.manager [-] update_available_resource took 1.59690380096 secs
2014-09-23 16:57:46.012 32200 ERROR nova.compute.manager [-] update_available_resource took 1.47475099564 secs
2014-09-23 16:58:46.088 32200 ERROR nova.compute.manager [-] update_available_resource took 1.55297207832 secs
2014-09-23 16:59:47.132 32200 ERROR nova.compute.manager [-] update_available_resource took 2.59809398651 secs
2014-09-23 17:00:46.144 32200 ERROR nova.compute.manager [-] update_available_resource took 0.605221986771 secs
2014-09-23 17:00:52.047 32200 ERROR nova.compute.manager [-] [instance: 56a02698-d951-436c-82f4-c0ec4dafa199] An error occurred while refreshing the network cache.
2014-09-23 17:01:47.637 32200 ERROR nova.compute.manager [-] update_available_resource took 1.10360908508 secs
2014-09-23 17:02:50.673 32200 ERROR nova.compute.manager [-] update_available_resource took 2.13903999329 secs
2014-09-23 17:03:50.816 32200 ERROR nova.compute.manager [-] update_available_resource took 1.28159403801 secs
2014-09-23 17:04:51.604 32200 ERROR nova.compute.manager [-] update_available_resource took 1.06992197037 secs
2014-09-23 17:05:51.215 32200 ERROR nova.compute.manager [-] update_available_resource took 0.680285930634 secs
2014-09-23 17:08:04.722 32200 ERROR nova.compute.manager [-] update_available_resource took 73.1881198883 secs
2014-09-23 17:08:08.014 32200 ERROR nova.compute.manager [-] update_available_resource took 0.926635026932 secs
2014-09-23 17:08:54.234 32200 ERROR nova.compute.manager [-] update_available_resource took 0.698902130127 secs
2014-09-23 17:09:54.044 32200 ERROR nova.compute.manager [-] update_available_resource took 0.510345935822 secs
2014-09-23 17:10:56.615 32200 ERROR nova.compute.manager [-] update_available_resource took 1.00940585136 secs
2014-09-23 17:11:56.171 32200 ERROR nova.compute.manager [-] update_available_resource took 0.573462963104 secs
2014-09-23 17:12:59.232 32200 ERROR nova.compute.manager [-] update_available_resource took 1.69808387756 secs
2014-09-23 17:13:59.958 32200 ERROR nova.compute.manager [-] update_available_resource took 0.424647808075 secs
2014-09-23 17:15:01.642 32200 ERROR nova.compute.manager [-] update_available_resource took 1.10785794258 secs
2014-09-23 17:16:03.711 32200 ERROR nova.compute.manager [-] update_available_resource took 0...

Read more...

Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

Looks like when the periodic task is slow all the time is consumed in _get_disk_over_committed_size_total

http://logs.openstack.org/38/123438/2/check/check-tempest-dsvm-neutron-full/bb1dbb6/logs/screen-n-cpu.txt.gz?#_2014-09-24_03_09_56_913

Sean Dague (sdague)
Changed in nova:
milestone: none → juno-rc1
Revision history for this message
Sean Dague (sdague) wrote :

yeh, this seems to be command timing out:

2014-09-24 03:07:59.449 32123 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-16ab2d39-59d6-4110-88fa-ab60332adf19-lun-1 execute /opt/stack/new/nova/nova/openstack/common/processutils.py:161
2014-09-24 03:09:56.856 32123 DEBUG nova.openstack.common.processutils [-] Result was 1 execute /opt/stack/new/nova/nova/openstack/common/processutils.py:195

grep processutils screen-n-cpu.txt.gz | grep 32123

To get only the periodic job executions

Revision history for this message
Matt Riedemann (mriedem) wrote :

From IRC, danpb pointed out that blockdev is blocking on purpose, and when the volume is gone that will take awhile, so that's what's causing the lock to be held so long:

http://logs.openstack.org/38/123438/2/check/check-tempest-dsvm-neutron-full/bb1dbb6/logs/screen-n-cpu.txt.gz?#_2014-09-24_03_09_56_913

2014-09-24 03:09:56.913 32123 ERROR nova.virt.libvirt.driver [-] _get_disk_over_committed_size_total took 117.755759954 secs

That's taking nearly 2 minutes.

From danpb:

(10:48:04 AM) danpb: sdague: blockdev taking a long time is a sign that the underlying storage is dead
(10:48:20 AM) danpb: eg, could be an iSCSI lun for which the iscsi server is no longer running
(10:48:20 AM) dansmith: and this is blockdev on an iscsi target, right?
(10:48:36 AM) sdague: yep
danieru danpb dansmith
(10:48:56 AM) sdague: so is there a non blocking way to poke that?
(10:48:56 AM) danpb: so good to check to see that cinder still has the corresponding iscsi server active when this happens
(10:49:00 AM) mriedem: danpb: is there a quicker way to find out up front if the bdm is gone?
(10:49:17 AM) sdague: because the issue is this goes off into a blocking call for 2 minutes
(10:49:17 AM) danpb: not that i know of, off hand
(10:49:21 AM) sdague: which is holding locks
(10:49:27 AM) danpb: generally this turns into an uninterruptable sleep in the kernel
(10:49:31 AM) sdague: and .... other stuff times out
(10:49:42 AM) dansmith: yeah, it's blocking by design
(10:49:50 AM) sdague: dansmith: yep
(10:49:52 AM) mriedem: ok, could we call off to cinder to find out of the volume is deleted?
(10:49:53 AM) dansmith: because you can't really tell that it's gone until it times out
(10:49:56 AM) mriedem: *if
(10:49:57 AM) danpb: being non-blocking on I/O errors is a good way to get data corruption
(10:50:08 AM) danpb: so the kernel generally avoids that

So we need a patch that checks if the volume is gone (non-blocking) before we call blockdev to get the size of the volume, that should free us up here. When I looked at this on Friday, the cinder logs were showing that the volume was deleted right around the time that we had the blockdev failure and stacktrace, so we can probably go back to cinder and see if the volume has been deleted just by it's state and then short circuit our work in _get_instance_disk_info.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Pulling this together, the update_available_resource periodic task is using the resource tracker to update host stats, and that rt method is locking on the 'compute_resource' lock, which is also what the resize_claim and drop_resize_claim methods in the resource tracker are locked on, and that's where we hit the timeout in the resize test - it's waiting to make a resize claim but the blockdev blocking call in the update_available_resource periodic task is taking too long, causing the 60 second RPC messaging timeout for the resize claim in the test run.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/123774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/123774
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=84ff466c8338cc44ca92a0f1673e204dd720c7d9
Submitter: Jenkins
Branch: master

commit 84ff466c8338cc44ca92a0f1673e204dd720c7d9
Author: Dan Smith <email address hidden>
Date: Wed Sep 24 09:05:26 2014 -0700

    Reduce the scope of RT work while holding the big lock

    This makes the RT.update_available_resource() method do what it
    can in the way of prep work before grabbing the lock to update
    the database stats.

    Change-Id: I1132e2ab99427688529566c1a1fffb33eab07f3f
    Related-Bug: #1371677

Revision history for this message
Matt Riedemann (mriedem) wrote :

Since https://review.openstack.org/#/c/123774/ merged there is a significant drop off in the race, so looks like that stemmed the bleeding. This is probably not a critical rc1 issue now. We should still put something in the libvirt driver to check if the volume bdm exists before calling the blocking blockdev command to get the volume size, but that could be a follow on improvement.

Michael Still (mikal)
Changed in nova:
milestone: juno-rc1 → none
Revision history for this message
Sean Dague (sdague) wrote :

We've landed a related fix to this. It's going to take 24hrs to figure out if this is under control. If not, will reopen.

I think this is still an rc1 bug, it's something we've been focussing a ton of time on of late.

Changed in nova:
milestone: none → juno-rc1
status: Triaged → Fix Committed
Revision history for this message
Daniel Berrange (berrange) wrote :
Download full text (3.2 KiB)

I've looking at the following logfile

http://logs.openstack.org/23/119523/9/check/check-tempest-dsvm-neutron-full/41505f7/logs/screen-n-cpu.txt.gz?level=TRACE#_2014-09-22_19_34_39_385

We can see a hot-unplug operation start for a cinder volume 6f88ab3e-9edd-4fbf-857e-2d6d87f98e39 at 19:32:32

2014-09-22 19:32:32.963 AUDIT nova.compute.manager [req-840df9bc-aee9-48d9-baa3-d58392954b0d ServerRescueNegativeTestJSON-161189933 ServerRescueNegativeTestJSON-667937194] [instance: 7b20e76a-48a0-4dd4-b2b6-bb3b1056b711] Detach volume 6f88ab3e-9edd-4fbf-857e-2d6d87f98e39 from mountpoint /dev/vdb

At 19:32:52 the periodic task starts to update resources

2014-09-22 19:32:52.747 31561 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /opt/stack/new/nova/nova/openstack/common/periodic_task.py:193

The libvirt get_available_resource method calls _get_instance_disk_info on each running VM.

This method queries the libvirt guest XML to get a list of disks present on the VM. It queries the database to get a list of attached cinder volumes. It then filters the list of disks from the XML removing the list of cinder volumes

None the less at 19:32:53 we see the code try to run blockdev on iSCSI LUN with serial number 6f88ab3e-9edd-4fbf-857e-2d6d87f98e39:

2014-09-22 19:32:53.025 31561 DEBUG nova.openstack.common.processutils [-] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f88ab3e-9edd-4fbf-857e-2d6d87f98e39-lun-1 execute /opt/stack/new/nova/nova/openstack/common/processutils.py:161

this path is the Cinder volume we just started to hotunplug at 19:32:32 (eg 20 seconds earlier).

What this says is that from Nova's POV the block device is no longer attached, but from libvirt's XML point the block device is still attached.

At 19:34:39 the blockdev command times out

2014-09-22 19:34:39.385 31561 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f88ab3e-9edd-4fbf-857e-2d6d87f98e39-lun-1
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-6f88ab3e-9edd-4fbf-857e-2d6d87f98e39-lun-1: No such device or address\n'

This indicates that the iscsi volume has now been removed, presumably because the earlier cinder detach operation has succeeded.

It is conceivable that if the 'blockdev' command tries to access the iSCSI lun at precisely the wrong moment it might hang as the iSCSI connection is torn down, rather than getting the 'No such device or address' error.

So to me the root cause problem here is the _get_instance_disk_info method. The way it is written to query the libvirt XML and filter from known cinder volumes is clearly racy. If we can re-write the method to avoid that race, then we'd in turn avoid runing blockdev command on cinder volumes & thus av...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

The race appears fixed with Dan Smith's locking change, since logstash results are back this hasn't shown up since his change merged.

Changed in nova:
assignee: nobody → Dan Smith (danms)
Revision history for this message
Daniel Berrange (berrange) wrote :

I doubt that Dan's locking change fixed it, probably just made it less likely to occur. From code inspection alone I can see the potential for race is there in libvirt driver code.

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

One more thing:

looking at the orriginal patch (http://logs.openstack.org/10/110110/40/check/check-tempest-dsvm-postgres-full/4cd8a81/logs/screen-n-cpu.txt.gz#_2014-09-19_10_24_31_404) and matts fix here https://review.openstack.org/122874 - we should never be hitting that exception in the gate because: this exception should never happen in _get_instance_disk_info UNLESS we are running LVM backed instances - which afaiu never happens in the gate.

So there seems to be another bug here which we should report.

Revision history for this message
Daniel Berrange (berrange) wrote :

Don't be misled by the fact that "get_volume_size" is in the lvm.py module. This method is actually called for *any* block device, not only LVM. So the call is explained by the race wrt cinder iSCSI volumes I mentioned in comment #23. LVM is a red herring

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

As Dan nicely points out in comment #23 - this is due to the race described. The real bug is using the libvirt XML to find out about local discs (or anything for that matter) and especially not mix Nova db information. I will report a bug for this.

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/174836
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=833357301bc80a27422f7bf081fae2d3da730a24
Submitter: Jenkins
Branch: master

commit 833357301bc80a27422f7bf081fae2d3da730a24
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 17 12:49:13 2015 +0100

    libvirt: make _get_instance_disk_info conservative

    We want to make sure we never try to get the size of an attached volume
    when doing _get_instance_disk_info (as this can cause issues when
    gathering available resources).

    libvirt's get_available_resources will call upon it to determine the
    available disk size on the compute node, but do so without providing
    information about block devices. This makes _get_instance_disk_info make
    incorrect guesses as to which device is a volume

    This patch makes the _get_instance_disk_info be more conservative about
    it's guesses when it can not reasonably determine if a device is a
    volume or not.

    Change-Id: Ifb20655c32896b640672917e3840add81b136780
    Partial-bug: #1445021
    Partial-Bug: #1371677

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/179500

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/kilo)

Change abandoned by Jay Bryant (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/179500
Reason: No one has been begging for this. Abandoning.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.