Ironic failing to gracefully handle: ipmi error "insufficient resources for session"

Bug #1431929 reported by Julia Kreger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
Julia Kreger

Bug Description

While testing with current ironic master branch, Upon attempting to deploy, utilizing ironic to deploy a number of nodes (ten to thirty five nodes), on a high density hardware chassis (HP Moonshot) that utilizes a dual bridge IPMI bus where the only difference between the nodes is the MAC address and the bridging target number. We observed a number of nodes that failed to deploy due to ipmitool returning "Error in open session response message : insufficient resources for session"

Apparent root cause:

_exec_ipmitool does not have retry logic logic wrapped around the execution of the script and instead the parameters are passed to the command line of ipmitool. In essence, ipmitool and Ironic presently fail hard when the BMC says "give me a minute".
---

How to reproduce:

Have ironic running in a stand-alone environment with no other openstack services.
Attempt to deploy, in-mass, to a number of nodes on that utilize dual ipmi_bridging. This is being orchustrated by python code utilizing the python-ironicclient library setting the node instance_info, and then setting the node to an active state, in series, however not waiting for any of the operations or nodes to reach an active state.

Attempting to tune the ipmi settings, specifically increasing the min_command_interval resulted in the deployments to fail to obtain node locks as tasks were taking longer to complete. Increasing retries did not appear to remedy this issue.

---

Ironic conductor log output for one of the nodes that failed to deploy in this manor:
2015-03-12 19:20:30.699 4307 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 160 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmpgaadM
r chassis bootdev disk options=persistent execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:199
2015-03-12 19:20:30.766 4307 DEBUG oslo_concurrency.processutils [-] CMD "ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 158 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmp3uQDcS chassis bootdev dis
k options=persistent" returned: 0 in 0.568s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:225
2015-03-12 19:20:30.769 4307 DEBUG ironic.common.utils [-] Execution completed, command line is "ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 158 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmp3uQ
DcS chassis bootdev disk options=persistent" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:83
2015-03-12 19:20:30.772 4307 DEBUG ironic.common.utils [-] Command stdout is: "Set Boot Device to disk
" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2015-03-12 19:20:30.775 4307 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2015-03-12 19:20:30.885 4307 DEBUG oslo_concurrency.processutils [-] CMD "ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 160 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmpgaadMr chassis bootdev dis
k options=persistent" returned: 1 in 0.185s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:225
2015-03-12 19:20:30.888 4307 DEBUG oslo_concurrency.processutils [-] u'ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 160 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmpgaadMr chassis bootdev disk o
ptions=persistent' failed. Not Retrying. execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:258
2015-03-12 19:20:30.891 4307 WARNING ironic.drivers.modules.ipmitool [-] IPMI set boot device failed for node a8cb6624-0d9f-c882-affc-046ebb96ec16 when executing "ipmitool chassis bootdev disk options=persistent". Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 10.0.1.5 -L ADMINISTRATOR -U administrator -B 0 -T 160 -b 7 -t 114 -R 12 -N 5 -f /tmp/tmpgaadMr chassis bootdev disk options=persistent
Exit code: 1
Stdout: u''
Stderr: u'Error in open session response message : insufficient resources for session\n\nError: Unable to establish IPMI v2 / RMCP+ session\nError setting Chassis Boot Parameter 0\nError setting Chassis Boot Parameter 4\n'
2015-03-12 19:20:30.897 4307 ERROR ironic.drivers.modules.agent_base_vendor [-] Asynchronous exception for node a8cb6624-0d9f-c882-affc-046ebb96ec16: Node failed to move to active state. exception: IPMI call failed: chassis bootdev disk options=persistent.
ed: chassis bootdev disk options=persistent.
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor Traceback (most recent call last):
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/drivers/modules/agent_base_vendor.py", line 169, in heartbeat
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor self.reboot_to_instance(task, **kwargs)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/drivers/modules/agent.py", line 378, in reboot_to_instance
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor manager_utils.node_set_boot_device(task, 'disk', persistent=True)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/conductor/task_manager.py", line 128, in wrapper
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor return f(*args, **kwargs)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/conductor/utils.py", line 44, in node_set_boot_device
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor persistent=persistent)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/conductor/task_manager.py", line 128, in wrapper
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor return f(*args, **kwargs)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor File "/usr/local/lib/python2.7/dist-packages/ironic/drivers/modules/ipmitool.py", line 745, in set_boot_device
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor raise exception.IPMIFailure(cmd=cmd)
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor IPMIFailure: IPMI call failed: chassis bootdev disk options=persistent.
2015-03-12 19:20:30.897 4307 TRACE ironic.drivers.modules.agent_base_vendor

ironic node-list:
+--------------------------------------+---------------+-------------+--------------------+-------------+
| UUID | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+---------------+-------------+--------------------+-------------+
| a8cb6624-0d9f-c882-affc-046ebb96ec10 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec11 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec12 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec13 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec14 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec15 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec16 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec17 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec18 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec19 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec20 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec21 | None | power on | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec22 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec23 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec24 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec25 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec26 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec27 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec28 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec29 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec30 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec31 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec32 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec33 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec34 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec35 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec36 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec37 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec38 | None | power off | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec39 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec40 | None | power on | deploy failed | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec41 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec42 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec43 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec44 | None | power on | active | False |
| a8cb6624-0d9f-c882-affc-046ebb96ec45 | None | power on | active | False |
+--------------------------------------+---------------+-------------+--------------------+-------------+

ironic node-show nodes 16, 17, 20:
+------------------------+--------------------------------------------------------------------------+
| Property | Value |
+------------------------+--------------------------------------------------------------------------+
| instance_uuid | None |
| target_power_state | None |
| properties | {u'memory_mb': u'8160', u'cpu_arch': u'x86_64', u'local_gb': u'450', |
| | u'cpus': u'3'} |
| maintenance | False |
| driver_info | {u'ipmi_transit_address': u'160', u'ipmi_target_channel': u'7', |
| | u'ipmi_transit_channel': u'0', u'deploy_kernel': |
| | u'http://10.99.1.1:8080/coreos_production_pxe.vmlinuz', |
| | u'ipmi_username': u'administrator', u'ipmi_address': u'10.0.1.5', |
| | u'ipmi_target_address': u'114', u'ipmi_password': u'******', |
| | u'ipmi_bridging': u'dual', u'deploy_ramdisk': u'http://10.99.1.1:8080 |
| | /coreos_production_pxe_image-oem.cpio.gz'} |
| extra | {} |
| last_error | Asynchronous exception for node a8cb6624-0d9f-c882-affc-046ebb96ec16: |
| | Node failed to move to active state. exception: IPMI call failed: |
| | chassis bootdev disk options=persistent. |
| created_at | 2015-03-03T23:17:35+00:00 |
| target_provision_state | active |
| driver | agent_ipmitool |
| updated_at | 2015-03-12T19:54:52+00:00 |
| maintenance_reason | None |
| instance_info | {u'root_gb': 10, u'image_source': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_checksum': |
| | u'7f76ce7cbff05f0254fd78d7bcaf3b45', u'image_url': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_disk_format': u'raw', |
| | u'configdrive': u'http://10.99.1.1:8080/configdrive-a8cb6624-0d9f-c882 |
| | -affc-046ebb96ec16.iso.gz'} |
| driver_internal_info | |
| chassis_uuid | |
| provision_state | deploy failed |
| reservation | None |
| power_state | power off |
| console_enabled | False |
| uuid | a8cb6624-0d9f-c882-affc-046ebb96ec16 |
+------------------------+--------------------------------------------------------------------------+
+------------------------+--------------------------------------------------------------------------+
| Property | Value |
+------------------------+--------------------------------------------------------------------------+
| instance_uuid | None |
| target_power_state | None |
| properties | {u'memory_mb': u'8160', u'cpu_arch': u'x86_64', u'local_gb': u'450', |
| | u'cpus': u'3'} |
| maintenance | False |
| driver_info | {u'ipmi_transit_address': u'162', u'ipmi_target_channel': u'7', |
| | u'ipmi_transit_channel': u'0', u'deploy_kernel': |
| | u'http://10.99.1.1:8080/coreos_production_pxe.vmlinuz', |
| | u'ipmi_username': u'administrator', u'ipmi_address': u'10.0.1.5', |
| | u'ipmi_target_address': u'114', u'ipmi_password': u'******', |
| | u'ipmi_bridging': u'dual', u'deploy_ramdisk': u'http://10.99.1.1:8080 |
| | /coreos_production_pxe_image-oem.cpio.gz'} |
| extra | {} |
| last_error | Failed to deploy. Error: IPMI call failed: raw 0x00 0x08 0x03 0x08. |
| created_at | 2015-03-03T23:17:41+00:00 |
| target_provision_state | active |
| driver | agent_ipmitool |
| updated_at | 2015-03-12T20:07:00+00:00 |
| maintenance_reason | None |
| instance_info | {u'root_gb': 10, u'image_source': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_checksum': |
| | u'7f76ce7cbff05f0254fd78d7bcaf3b45', u'image_url': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_disk_format': u'raw', |
| | u'configdrive': u'http://10.99.1.1:8080/configdrive-a8cb6624-0d9f-c882 |
| | -affc-046ebb96ec17.iso.gz'} |
| driver_internal_info | |
| chassis_uuid | |
| provision_state | deploy failed |
| reservation | None |
| power_state | power off |
| console_enabled | False |
| uuid | a8cb6624-0d9f-c882-affc-046ebb96ec17 |
+------------------------+--------------------------------------------------------------------------+
+------------------------+--------------------------------------------------------------------------+
| Property | Value |
+------------------------+--------------------------------------------------------------------------+
| instance_uuid | None |
| target_power_state | None |
| properties | {u'memory_mb': u'8160', u'cpu_arch': u'x86_64', u'local_gb': u'450', |
| | u'cpus': u'3'} |
| maintenance | False |
| driver_info | {u'ipmi_transit_address': u'168', u'ipmi_target_channel': u'7', |
| | u'ipmi_transit_channel': u'0', u'deploy_kernel': |
| | u'http://10.99.1.1:8080/coreos_production_pxe.vmlinuz', |
| | u'ipmi_username': u'administrator', u'ipmi_address': u'10.0.1.5', |
| | u'ipmi_target_address': u'114', u'ipmi_password': u'******', |
| | u'ipmi_bridging': u'dual', u'deploy_ramdisk': u'http://10.99.1.1:8080 |
| | /coreos_production_pxe_image-oem.cpio.gz'} |
| extra | {} |
| last_error | Asynchronous exception for node a8cb6624-0d9f-c882-affc-046ebb96ec20: |
| | Node failed to move to active state. exception: IPMI call failed: raw |
| | 0x00 0x08 0x03 0x08. |
| created_at | 2015-03-12T19:00:06+00:00 |
| target_provision_state | active |
| driver | agent_ipmitool |
| updated_at | 2015-03-12T20:07:15+00:00 |
| maintenance_reason | None |
| instance_info | {u'root_gb': 10, u'image_source': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_checksum': |
| | u'7f76ce7cbff05f0254fd78d7bcaf3b45', u'image_url': |
| | u'http://10.99.1.1:8080/expemental.qcow2', u'image_disk_format': u'raw', |
| | u'configdrive': u'http://10.99.1.1:8080/configdrive-a8cb6624-0d9f-c882 |
| | -affc-046ebb96ec20.iso.gz'} |
| driver_internal_info | |
| chassis_uuid | |
| provision_state | deploy failed |
| reservation | None |
| power_state | power off |
| console_enabled | False |
| uuid | a8cb6624-0d9f-c882-affc-046ebb96ec20 |
+------------------------+--------------------------------------------------------------------------+

Tags: ipmi
description: updated
tags: added: ipmi
Changed in ironic:
assignee: nobody → Julia Kreger (juliaashleykreger)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/168120

Changed in ironic:
status: New → In Progress
aeva black (tenbrae)
Changed in ironic:
importance: Undecided → Medium
milestone: none → kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/168120
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=59ccf51c179a92a1e9541fa54715ace8b110de91
Submitter: Jenkins
Branch: master

commit 59ccf51c179a92a1e9541fa54715ace8b110de91
Author: Julia Kreger <email address hidden>
Date: Thu Mar 26 15:17:41 2015 -0400

    Add retry logic to _exec_ipmitool

    When using bridged IPMI buses, the possibility exists for the BMC
    to return messages indicating that it is too busy to service the
    request at that moment. This results in ipmitool returning an
    error to the user instead of retrying the operation.

    As a result of the error being returned, the current operation
    being attempted by ironic is failed and is never retried as the
    present retry configuration is purely a pass-through to ipmitool.

    Added logic to allow for the identification of errors returned by
    ipmitool that can be retried coupled with logic to retry the failed
    operation until the configured retry window has passed.

    Included three tests which are geared to exercise the retry logic
    and ensure that the retry logic works as expected, including that
    the retry operations are only performed a specific number of times
    based on the conditions.

    Updated one of the pre-existing tests that tested _exec_ipmitool
    act of returning an exception to ensure that the retry logic is
    executed only once.

    Change-Id: I698a8ff1cb70d7758ed05e59b858d9e40708dd05
    Closes-Bug: 1431929

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.