OpenStack Compute (nova)

Bug #1371677
Comment #17

Comment 17 for bug 1371677

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-24:

#17

From IRC, danpb pointed out that blockdev is blocking on purpose, and when the volume is gone that will take awhile, so that's what's causing the lock to be held so long:

http://logs.openstack.org/38/123438/2/check/check-tempest-dsvm-neutron-full/bb1dbb6/logs/screen-n-cpu.txt.gz?#_2014-09-24_03_09_56_913

2014-09-24 03:09:56.913 32123 ERROR nova.virt.libvirt.driver [-] _get_disk_over_committed_size_total took 117.755759954 secs

That's taking nearly 2 minutes.

From danpb:

(10:48:04 AM) danpb: sdague: blockdev taking a long time is a sign that the underlying storage is dead
(10:48:20 AM) danpb: eg, could be an iSCSI lun for which the iscsi server is no longer running
(10:48:20 AM) dansmith: and this is blockdev on an iscsi target, right?
(10:48:36 AM) sdague: yep
danieru danpb dansmith
(10:48:56 AM) sdague: so is there a non blocking way to poke that?
(10:48:56 AM) danpb: so good to check to see that cinder still has the corresponding iscsi server active when this happens
(10:49:00 AM) mriedem: danpb: is there a quicker way to find out up front if the bdm is gone?
(10:49:17 AM) sdague: because the issue is this goes off into a blocking call for 2 minutes
(10:49:17 AM) danpb: not that i know of, off hand
(10:49:21 AM) sdague: which is holding locks
(10:49:27 AM) danpb: generally this turns into an uninterruptable sleep in the kernel
(10:49:31 AM) sdague: and .... other stuff times out
(10:49:42 AM) dansmith: yeah, it's blocking by design
(10:49:50 AM) sdague: dansmith: yep
(10:49:52 AM) mriedem: ok, could we call off to cinder to find out of the volume is deleted?
(10:49:53 AM) dansmith: because you can't really tell that it's gone until it times out
(10:49:56 AM) mriedem: *if
(10:49:57 AM) danpb: being non-blocking on I/O errors is a good way to get data corruption
(10:50:08 AM) danpb: so the kernel generally avoids that

So we need a patch that checks if the volume is gone (non-blocking) before we call blockdev to get the size of the volume, that should free us up here. When I looked at this on Friday, the cinder logs were showing that the volume was deleted right around the time that we had the blockdev failure and stacktrace, so we can probably go back to cinder and see if the volume has been deleted just by it's state and then short circuit our work in _get_instance_disk_info.

From IRC, danpb pointed out that blockdev is blocking on purpose, and when the volume is gone that will take awhile, so that's what's causing the lock to be held so long:

http://logs.openstack.org/38/123438/2/check/check-tempest-dsvm-neutron-full/bb1dbb6/logs/screen-n-cpu.txt.gz?#_2014-09-24_03_09_56_913

2014-09-24 03:09:56.913 32123 ERROR nova.virt.libvirt.driver [-] _get_disk_over_committed_size_total took 117.755759954 secs

That's taking nearly 2 minutes.

From danpb:

(10:48:04 AM) danpb: sdague: blockdev taking a long time is a sign that the underlying storage is dead
(10:48:20 AM) danpb: eg, could be an iSCSI lun for which the iscsi server is no longer running
(10:48:20 AM) dansmith: and this is blockdev on an iscsi target, right?
(10:48:36 AM) sdague: yep
danieru danpb dansmith 
(10:48:56 AM) sdague: so is there a non blocking way to poke that?
(10:48:56 AM) danpb: so good to check to see that cinder still has the corresponding iscsi server active when this happens
(10:49:00 AM) mriedem: danpb: is there a quicker way to find out up front if the bdm is gone?
(10:49:17 AM) sdague: because the issue is this goes off into a blocking call for 2 minutes
(10:49:17 AM) danpb: not that i know of, off hand
(10:49:21 AM) sdague: which is holding locks
(10:49:27 AM) danpb: generally this turns into an uninterruptable sleep in the kernel
(10:49:31 AM) sdague: and .... other stuff times out
(10:49:42 AM) dansmith: yeah, it's blocking by design
(10:49:50 AM) sdague: dansmith: yep
(10:49:52 AM) mriedem: ok, could we call off to cinder to find out of the volume is deleted?
(10:49:53 AM) dansmith: because you can't really tell that it's gone until it times out
(10:49:56 AM) mriedem: *if
(10:49:57 AM) danpb: being non-blocking on I/O errors is a good way to get data corruption
(10:50:08 AM) danpb: so the kernel generally avoids that

So we need a patch that checks if the volume is gone (non-blocking) before we call blockdev to get the size of the volume, that should free us up here.  When I looked at this on Friday, the cinder logs were showing that the volume was deleted right around the time that we had the blockdev failure and stacktrace, so we can probably go back to cinder and see if the volume has been deleted just by it's state and then short circuit our work in _get_instance_disk_info.