boot server request randomly hanging at n-cpu side, and didn't get to Ironic

Bug #1651678 reported by Vasyl Saienko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Critical
Unassigned
OpenStack Compute (nova)
Fix Released
High
Jay Pipes
Newton
Fix Committed
Medium
Matt Riedemann

Bug Description

Ironic gate jobs are randomly timing out during last few weeks:

An example is: http://logs.openstack.org/46/327046/36/check/gate-tempest-dsvm-ironic-ipa-partition-pxe_ipmitool-tinyipa-ubuntu-xenial/48db3ea/console.html

2016-12-20 23:30:24.418214 | Traceback (most recent call last):
2016-12-20 23:30:24.418231 | File "tempest/test.py", line 99, in wrapper
2016-12-20 23:30:24.418248 | return f(self, *func_args, **func_kwargs)
2016-12-20 23:30:24.418296 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/ironic_tempest_plugin/tests/scenario/test_baremetal_basic_ops.py", line 111, in test_baremetal_server_ops
2016-12-20 23:30:24.418316 | self.instance, self.node = self.boot_instance()
2016-12-20 23:30:24.418361 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/ironic_tempest_plugin/tests/scenario/baremetal_manager.py", line 173, in boot_instance
2016-12-20 23:30:24.418375 | self.wait_node(instance['id'])
2016-12-20 23:30:24.418417 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/ironic_tempest_plugin/tests/scenario/baremetal_manager.py", line 117, in wait_node
2016-12-20 23:30:24.418441 | raise lib_exc.TimeoutException(msg)
2016-12-20 23:30:24.418464 | tempest.lib.exceptions.TimeoutException: Request timed out
2016-12-20 23:30:24.418494 | Details: Timed out waiting to get Ironic node by instance id 50e23a00-5b92-49b7-8dd0-5b8715ba7e26

Nova compute seems stuck at "_do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:1754"

2016-12-21 13:24:24.307 21735 DEBUG oslo_messaging._drivers.amqpdriver [-] received message with unique_id: 3b9dab54da604a8cadc6c854588a1a5d __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:196
2016-12-21 13:24:24.312 21735 DEBUG oslo_concurrency.lockutils [req-7b291e0c-c5b3-4a8a-b4db-e7cef3150b03 tempest-BaremetalBasicOps-1775111554 tempest-BaremetalBasicOps-1775111554] Lock "6376a75b-2970-42f5-9f1b-b34db22a23e4" acquired by "nova.compute.manager._locked_do_build_and_run_instance" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2016-12-21 13:24:24.313 21735 DEBUG oslo_messaging._drivers.amqpdriver [req-7b291e0c-c5b3-4a8a-b4db-e7cef3150b03 tempest-BaremetalBasicOps-1775111554 tempest-BaremetalBasicOps-1775111554] CALL msg_id: 92cc73436d164feab727c5b7c81ec179 exchange 'nova' topic 'conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:442
2016-12-21 13:24:24.326 21735 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 92cc73436d164feab727c5b7c81ec179 __call__ /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:299
2016-12-21 13:24:24.327 21735 DEBUG nova.compute.manager [req-7b291e0c-c5b3-4a8a-b4db-e7cef3150b03 tempest-BaremetalBasicOps-1775111554 tempest-BaremetalBasicOps-1775111554] [instance: 6376a75b-2970-42f5-9f1b-b34db22a23e4] Starting instance... _do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:1754
2016-12-21 13:24:24.330 21735 DEBUG oslo_messaging._drivers.amqpdriver [req-7b291e0c-c5b3-4a8a-b4db-e7cef3150b03 tempest-BaremetalBasicOps-1775111554 tempest-BaremetalBasicOps-1775111554] CALL msg_id: 15898ce761a143c690ea51c6af5d4f23 exchange 'nova' topic 'conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:442
2016-12-21 13:24:24.367 21735 DEBUG nova.compute.resource_tracker [req-f3cfc8fa-df45-4da4-adf2-83688458fa16 - -] Compute_service record updated for ubuntu-xenial-osic-cloud1-s3500-6327285:039bbc98-5123-470c-8e09-74e8f35a1391 _update_available_resource /opt/stack/new/nova/nova/compute/resource_tracker.py:601
2016-12-21 13:24:24.367 21735 DEBUG oslo_concurrency.lockutils [req-f3cfc8fa-df45-4da4-adf2-83688458fa16 - -] Lock "compute_resources" released by "nova.compute.resource_tracker._update_available_resource" :: held 6.935s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282

full log available: http://logs.openstack.org/39/404239/14/check/gate-tempest-dsvm-ironic-ipa-wholedisk-pxe_snmp-tinyipa-ubuntu-xenial-nv/8f98498/logs/screen-n-cpu.txt.gz#_2016-12-21_13_24_24_307

Tags: compute ironic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/413499

Revision history for this message
Vasyl Saienko (vsaienko) wrote : Re: Slow nodes at the gate leads to ironic tempest timeouts

@Nova team: please have look at the issue, request stuck at nova compute and didn't get to ironic.

description: updated
Vasyl Saienko (vsaienko)
description: updated
summary: - Slow nodes at the gate leads to ironic tempest timeouts
+ boot server request randomly hanging at n-cpu side, and didn't get to
+ Ironic
Revision history for this message
Jay Pipes (jaypipes) wrote :

@vasienko: Is there a particular time period that this started happening?

Revision history for this message
Jay Pipes (jaypipes) wrote :

Looks like inventory records are trying to be written for VCPU resource class with a total of 0. This should never happen, so I'll throw some verbose logging into a local Nova and see what's up...

Changed in nova:
importance: Undecided → High
assignee: nobody → Jay Pipes (jaypipes)
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: compute ironic
Revision history for this message
Ruby Loo (rloo) wrote :

@Jay. From http://ci-watch.tintri.com/project?project=ironic&time=7+days, it looks like it started happening Dec 19, eg with this patch https://review.openstack.org/#/c/250478/25.

Changed in ironic:
importance: Undecided → Critical
Revision history for this message
John L. Villalovos (happycamp) wrote :

I noticed with Jay's logging patch see this:
http://logs.openstack.org/26/414426/1/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/af7b923/logs/screen-n-cpu.txt.gz?level=WARNING#_2016-12-23_06_21_16_401

2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager [req-485b5c7b-3eb4-4e36-8fb0-3234acb45957 - -] Error updating resources for node 4e8e9ae3-f417-49c4-a0ab-4d1300e4ef0e.
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager Traceback (most recent call last):
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 6536, in update_available_resource_for_node
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 526, in update_available_resource
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager self._verify_resources(resources)
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 962, in _verify_resources
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager uuid = self.compute_node.uuid
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager AttributeError: 'NoneType' object has no attribute 'uuid'
2016-12-23 06:21:16.401 29357 ERROR nova.compute.manager

I wonder if that is an issue that self.compute_node is None.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/414579

Revision history for this message
Vladyslav Drok (vdrok) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/414603

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: Jay Pipes (jaypipes) → John L. Villalovos (happycamp)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Vladyslav Drok (<email address hidden>) on branch: master
Review: https://review.openstack.org/414603

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Vladyslav Drok (<email address hidden>) on branch: master
Review: https://review.openstack.org/414579

Changed in nova:
assignee: John L. Villalovos (happycamp) → Jay Pipes (jaypipes)
Changed in nova:
assignee: Jay Pipes (jaypipes) → John L. Villalovos (happycamp)
Revision history for this message
Alex Xu (xuhj) wrote :
Revision history for this message
Vasyl Saienko (vsaienko) wrote :

@Alex: each Ironic node appear in nova like hypervisor, nova lock whole compute when update resources for hypervisor. In the log [0] you can see that resources are updated for different ironic nodes (nova hypervisors). Short report from the log [2]. When we send 0 values to placement API, it returns 400 error, and it takes near 6 seconds to complete this operation. In [2] I've copied update resources for single node (acquired lock at: 13:24:44.599 released lock at: 13:24:51.281, held 6.681s). So indeed it looks like lock starvation which is caused by trying to send 0 values to placement API, which anyways not valid value and request is discarded on API anyway (we got 400). The fix proposed [3] avoids sending 0 values, as they will never be accepted by API.

[0] http://logs.openstack.org/39/404239/14/check/gate-tempest-dsvm-ironic-ipa-wholedisk-pxe_snmp-tinyipa-ubuntu-xenial-nv/8f98498/logs/screen-n-cpu.txt.gz#_2016-12-21_13_24_24_327
[1] http://paste.openstack.org/show/593421/
[2] http://paste.openstack.org/show/593422/
[3] https://review.openstack.org/#/c/414214/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/415403

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/415403
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=be6f34c3388365f93c84671f07af7a1cffb6efd8
Submitter: Jenkins
Branch: master

commit be6f34c3388365f93c84671f07af7a1cffb6efd8
Author: Vasyl Saienko <email address hidden>
Date: Wed Dec 28 10:39:44 2016 +0200

    Disable placement-api by default

    With placement-api enablement Ironic jobs got into regression due to
    [0]. The fix in Nova stuck at the decision phase [1]. Until Nova fix
    issue permanently disable it in ironic to unblock CI.

    [0] https://bugs.launchpad.net/ironic/+bug/1651678
    [1] https://review.openstack.org/#/c/414214/

    Change-Id: I658ec963761dd4c5111a98e77298a8974553e857
    Related-Bug: #1651678

Changed in nova:
assignee: John L. Villalovos (happycamp) → Jay Pipes (jaypipes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/415523

Changed in nova:
assignee: Jay Pipes (jaypipes) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Jay Pipes (jaypipes)
Changed in ironic:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/414214
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3c217acb9c55d647ca362320d697e80d7cfa5ceb
Submitter: Jenkins
Branch: master

commit 3c217acb9c55d647ca362320d697e80d7cfa5ceb
Author: Jay Pipes <email address hidden>
Date: Thu Dec 22 11:09:15 2016 -0500

    placement: Do not save 0-valued inventory

    Ironic nodes that are not available or operable have 0 values for vcpus,
    memory_mb, and local_gb in the returned dict from the Ironic virt driver's
    get_available_resource() call. Don't try to save these 0 values in the
    placement API inventory records, since the placement REST API will return an
    error. Instead, attempt to delete any inventory records for that Ironic node
    resource provider by PUT'ing an empty set of inventory records to the placement
    API.

    Closes-bug: #1651678

    Change-Id: I10b22606f704abcb970939fb2cd77f026d4d6322

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/416762

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/415523
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=dc673e34b4243ed704e3a4562c5ca73d6b4f4ccd
Submitter: Jenkins
Branch: master

commit dc673e34b4243ed704e3a4562c5ca73d6b4f4ccd
Author: Matt Riedemann <email address hidden>
Date: Wed Dec 28 18:44:09 2016 +0000

    Revert "Disable placement-api by default"

    This reverts commit be6f34c3388365f93c84671f07af7a1cffb6efd8

    Now that the openstack/nova commit
    3c217acb9c55d647ca362320d697e80d7cfa5ceb has landed. The issue is now
    fixed and we can revert this change.

    Change-Id: I6729706b13d49a277a2f037ea7ef00211d4a4ca9
    Related-Bug: #1651678

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b3

This issue was fixed in the openstack/nova 15.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/416762
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d9b74f230c9c5233f9ccea74a61b051d618061a3
Submitter: Jenkins
Branch: stable/newton

commit d9b74f230c9c5233f9ccea74a61b051d618061a3
Author: Jay Pipes <email address hidden>
Date: Thu Dec 22 11:09:15 2016 -0500

    placement: Do not save 0-valued inventory

    Ironic nodes that are not available or operable have 0 values for vcpus,
    memory_mb, and local_gb in the returned dict from the Ironic virt driver's
    get_available_resource() call. Don't try to save these 0 values in the
    placement API inventory records, since the placement REST API will return an
    error. Instead, attempt to delete any inventory records for that Ironic node
    resource provider by PUT'ing an empty set of inventory records to the placement
    API.

    Closes-bug: #1651678

    Conflicts:
            nova/scheduler/client/report.py
            nova/tests/unit/scheduler/client/test_report.py

    NOTE(mriedem): The test conflict is due to refacoring change
    1870c75c45a99f879ac7dc0ab5e5b30ad82880cb not being in
    stable/newton. The report.py conflict is due to change
    13ba33a57488f5e553dcca7162cc35b58ff0e7f8 coming before this.

    Change-Id: I10b22606f704abcb970939fb2cd77f026d4d6322
    (cherry picked from commit 3c217acb9c55d647ca362320d697e80d7cfa5ceb)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.5

This issue was fixed in the openstack/nova 14.0.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.