Tempest failures on NodeLocked/Conflict errors need workaround

Bug #1316773 reported by Adam Gandelman
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
David Shrewsbury

Bug Description

Tempest seems to be periodically failing in the tearDown of tempest.api.baremetal.test_nodes. ir-api claims the node is locked and operations can not complete:

2014-05-06 15:31:31.117 8282 ERROR wsme.api [-] Server-side error: "Node ad0119dc-ae45-422c-9573-2bad141e0f99 is locked by host localhost, please retry after the current operation is complete

http://logs.openstack.org/76/88476/8/gate/gate-tempest-dsvm-ironic/11cb20a/logs/

Revision history for this message
Dmitry Tantsur (divius) wrote :

I can confirm this problem (with exception that error now is Conflict). This is something that should be fixed by https://review.openstack.org/#/c/94923/ in the long term and may need a workaround either in Tempest or in client for now (just retrying should be enough).

summary: - tempest failures on NodeLocked errors
+ Tempest failures on NodeLocked errors needs workaround
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
tags: added: low-hanging-fruit
summary: - Tempest failures on NodeLocked errors needs workaround
+ Tempest failures on NodeLocked/Conflict errors need workaround
aeva black (tenbrae)
Changed in ironic:
importance: Medium → High
Changed in ironic:
assignee: nobody → David Shrewsbury (dshrews)
Changed in ironic:
assignee: David Shrewsbury (dshrews) → nobody
aeva black (tenbrae)
Changed in ironic:
milestone: none → juno-3
Revision history for this message
aeva black (tenbrae) wrote :

Addressed by the following patch.

commit 428c24767575419d5b472b52f2b622588a3a7253
Author: David Shrewsbury <email address hidden>
Date: Thu Jul 17 09:41:00 2014 -0400

    Implement retry on NodeLocked exceptions

    This adds the capability to retry node locking when a node is already
    locked. Two conductor configuration variables are added to control this
    behavior, which can be totally disabled by setting the value for the
    node_locked_retry_attempts variable to zero.

    A new unit test is added to check that the retry will work after first
    getting a NodeLocked exception, then succeeding getting the lock on the
    next lock attempt.

    Change-Id: I60b099beea1bc3a954a5ab4699e623aaa71ba6c5
    Blueprint: add-nodelocked-retry

Changed in ironic:
assignee: nobody → David Shrewsbury (dshrews)
status: Triaged → Fix Committed
milestone: juno-3 → juno-2
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-2 → 2014.2
Revision history for this message
Li Shang (li-shang) wrote :

seems this issue still occurred on Juno 2014.2, my environment is stable/juno. when i try to delete one ironic node, ironic-api reports related error log, and wait for a long time the node still can`t be deleted:

$ ironic node-delete 3645aead-1f74-418e-9e51-1cc6c31d3af4
'unicode' object has no attribute 'get'

2014-11-21 18:39:06.637 DEBUG oslo.messaging._drivers.amqp [-] UNIQUE_ID is e838df558ddc49fd9e21c61b8b5c9f36. from (pid=5382) _add_unique_id /usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqp.py:246
2014-11-21 18:39:08.695 WARNING wsme.api [-] Client-side error: Node 3645aead-1f74-418e-9e51-1cc6c31d3af4 is locked by host localhost, please retry after the current operation is completed.
Traceback (most recent call last):

  File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/server.py", line 139, in inner
    return func(*args, **kwargs)

  File "/opt/stack/ironic/ironic/conductor/manager.py", line 1053, in destroy_node
    with task_manager.acquire(context, node_id) as task:

  File "/opt/stack/ironic/ironic/conductor/task_manager.py", line 132, in acquire
    driver_name=driver_name)

  File "/opt/stack/ironic/ironic/conductor/task_manager.py", line 192, in __init__
    self.release_resources()

  File "/usr/local/lib/python2.7/dist-packages/oslo/utils/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)

  File "/opt/stack/ironic/ironic/conductor/task_manager.py", line 184, in __init__
    reserve_node()

  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 68, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)

  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 229, in call
    raise attempt.get()

  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 261, in get
    six.reraise(self.value[0], self.value[1], self.value[2])

  File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 217, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

  File "/opt/stack/ironic/ironic/conductor/task_manager.py", line 180, in reserve_node
    self.node = objects.Node.reserve(context, CONF.host, node_id)

  File "/opt/stack/ironic/ironic/objects/base.py", line 109, in wrapper
    result = fn(cls, context, *args, **kwargs)

  File "/opt/stack/ironic/ironic/objects/node.py", line 165, in reserve
    db_node = cls.dbapi.reserve_node(tag, node_id)

  File "/opt/stack/ironic/ironic/db/sqlalchemy/api.py", line 229, in reserve_node
    host=node['reservation'])

NodeLocked: Node 3645aead-1f74-418e-9e51-1cc6c31d3af4 is locked by host localhost, please retry after the current operation is completed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.