attach volume consume too long time, and if attaching another volume at the same time, create the bdm will be timeout

Bug #1897424 reported by Jorhson Deng
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Wishlist
Lee Yarwood

Bug Description

2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi [req-408c9350-fe1e-11ea-b09f-ad32aa0197ce req-1903c5bf-9263-4afe-b72f-9f3e7295c790 4e6abdba9163438dafc7219cb119d222 f34ca0cc0b664331ba7f6220a816bd81 - default default] Unexpected exception in API method: MessagingTimeout: Timed out waiting for a reply to message ID 998b0b462864458f9a24dbb0896f9093
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 110, in wrapper
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/inspur/api/volumes.py", line 331, in create
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi supports_multiattach=supports_multiattach)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/inspur/compute/api_inspur.py", line 1327, in attach_volume
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi is_root=is_root)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/inspur/compute/api_inspur.py", line 1345, in _attach_volume
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi device_type=device_type, tag=tag)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 3958, in _create_volume_bdm
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi multiattach=volume['multiattach'])
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 897, in reserve_block_device_name
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi return cctxt.call(ctxt, 'reserve_block_device_name', **kw)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi retry=self.retry)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi retry=retry)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi call_monitor_timeout, retry=retry)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi call_monitor_timeout)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi message = self.waiters.get(msg_id, timeout=timeout)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi 'to message ID %s' % msg_id)
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi MessagingTimeout: Timed out waiting for a reply to message ID 998b0b462864458f9a24dbb0896f9093
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi
2020-09-24 12:26:46.763 29 INFO nova.api.openstack.wsgi [req-408c9350-fe1e-11ea-b09f-ad32aa0197ce req-1903c5bf-9263-4afe-b72f-9f3e7295c790 4e6abdba9163438dafc7219cb119d222 f34ca0cc0b664331ba7f6220a816bd81 - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'>
2020-09-24 12:26:46.758 29 ERROR nova.api.openstack.wsgi
2020-09-24 12:26:46.763 29 INFO nova.api.openstack.wsgi [req-408c9350-fe1e-11ea-b09f-ad32aa0197ce req-1903c5bf-9263-4afe-b72f-9f3e7295c790 4e6abdba9163438dafc7219cb119d222 f34ca0cc0b664331ba7f6220a816bd81 - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'oslo_messaging.exceptions.MessagingTimeout'>
2020-09-24 12:26:46.789 29 INFO nova.osapi_compute.wsgi.server [req-408c9350-fe1e-11ea-b09f-ad32aa0197ce req-1903c5bf-9263-4afe-b72f-9f3e7295c790 4e6abdba9163438dafc7219cb119d222 f34ca0cc0b664331ba7f6220a816bd81 - default default] Traceback (most recent call last):
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/wsgi.py", line 521, in handle_one_response
    write(b''.join(towrite))
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/wsgi.py", line 462, in write
    wfile.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenio/base.py", line 390, in sendall
    tail = self.send(data, flags)
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenio/base.py", line 384, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenio/base.py", line 371, in _send_loop
    return send_method(data, *args)
error: [Errno 104] Connection reset by peer

Revision history for this message
Lee Yarwood (lyarwood) wrote :

That's by design and suggests another issue on the compute itself within reserve_block_device_name.

Can you trace the request on the actual compute and provide logs?

Changed in nova:
status: New → Incomplete
assignee: nobody → Lee Yarwood (lyarwood)
Revision history for this message
Jorhson Deng (jorhson) wrote :

In continuous attaching volumes, if the attching volume takes off too long time in nova-compute, and at the same time, another attaching request received by nova-api, nova-api will post RPC call request to nova-compute to create bdm. If the nova-compute won't return the response in time, the nova-api will wait timeout and raise Exception. However, the nova-compute will still create bdm when attaching the previous volume over. The log is the problem log when attaching the second volume.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

The logs in c#0 are of a MessagingTimeout raised on n-api.

As I said in c#1 if we have somehow created the bdm in n-cpu and still timed out when replying to n-api then it suggests there's an issue within reserve_block_device_name. Can you please provide logs from the n-cpu process on the other side of this so we can trace why this is timing out in your env?

Revision history for this message
Jorhson Deng (jorhson) wrote :

In my env, the block storage sometimes will take long time to connect. So, if i attach the volumes continuously, the latter volume will be waited to the former volume attached completely, and that's why creating bdm timeout in n-api.If adding a delay time in the attach_volume fucntion, the problem will be occured. And we face different types volumes and the timeout can't be predicted in sometime.
I think we should add task state to void this problem.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Again, this timeout has nothing to do with the time taken to connect volumes. The RPCAPI call that's timing out in your environment is reserve_block_device_name:

https://github.com/openstack/nova/blob/2745e685376abbc4c32516837f6074a3de23aa24/nova/compute/manager.py#L6930-L6969

To understand why that's taking so long we really need to see n-cpu logs covering req-408c9350-fe1e-11ea-b09f-ad32aa0197ce. Your task state change will do nothing to avoid this btw.

Revision history for this message
Jorhson Deng (jorhson) wrote :

Maybe i have not described clearly in c#4.
The application scenarios is to attach more than one volume and the timeout is not taken placed in reserve_block_device_name, but happened in attach_volume fucntion of n-cpu. Because if the first the volume attaching process take too long time in attach_volume fucntion, and if i attach the second volume at the same time, the n-api will post the PRC call to n-cpu, however, the n-cpu will not deal the second volume's request, because it's still dealing with the first volume attaching. So, if the first volume attaching takes too long time, which will lead to timeout in second volume attaching of _create_volume_bdm fucntion in n-api. And the n-cpu don't know the n-api has time out, and it will deal with the RPC call requet after the first volume attached completely.
And there no logs in reserve_block_device_name about the second volume in n-cpu, it just deal the RPC call request and return BDM object. the timeout isn't taken place in reserve_block_device_name, but in attach_volume in manager.py function for attaching the first volume.
If the instance has a task state to indecate the volume attaching process, it call be forbidden to attach another volume when the current volume attaching process not completed.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Right, apologies I forgot that reserve_block_device_name also takes an instance.uuid lock on the compute [1] so we can end up in a situation where an initial request to attach a volume holds the lock during attach_volume [2] while a second request is waiting at the instance.uuid lock in reserve_block_device_name allowing the RPC call to timeout back on n-api.

An instance task state does sound like a good idea to avoid this as we can't move the instance.uuid lock over to n-api to cover both the reserve and attach as it would delay the actual API response for too long IMHO.

An alternative would be to drop the RPC call to reserve_block_device_name entirely as the returned device isn't guaranteed to be the actual device that ends up being presented within the guestOS anyway by most hypervisors (namely libvirt). We could always create the bdm in the API and update the device once attached on the compute.

IMHO either approach is going to require a spec to hammer out the details.

[1] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/compute/manager.py#L6946
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/compute/manager.py#L6978

Revision history for this message
Brin Zhang (zhangbailin) wrote :

An instance task state does sound like a good idea to avoid this as we can't move the instance.uuid lock over to n-api to cover both the reserve and attach as it would delay the actual API response for too long IMHO.
-----
I prefer to fix the current problem in this way, which can fundamentally solve this problem.

yes, we need a spec to hammer out the details.

Changed in nova:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.