Instances go to error state with RPC timeout

Bug #973609 reported by David Lawson
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

We've been seeing a trend of people being unable to start instances, investigation yields the following tracebacks in the nova-compute.log on the relevant compute nodes:

2012-04-04 14:21:10 ERROR nova.compute.manager [-] [instance: d86678fb-2a29-4b4e-84b4-c1cb2d81a6e2] Instance failed network setup
(nova.compute.manager): TRACE: Traceback (most recent call last):
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 559, in _allocate_network
(nova.compute.manager): TRACE: requested_networks=requested_networks)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/api.py", line 170, in allocate_for_instance
(nova.compute.manager): TRACE: 'args': args})
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/__init__.py", line 68, in call
(nova.compute.manager): TRACE: return _get_impl().call(context, topic, msg, timeout)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 674, in call
(nova.compute.manager): TRACE: return rpc_amqp.call(context, topic, msg, timeout, Connection.pool)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 338, in call
(nova.compute.manager): TRACE: rv = list(rv)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 299, in __iter__
(nova.compute.manager): TRACE: self._iterator.next()
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 572, in iterconsume
(nova.compute.manager): TRACE: yield self.ensure(_error_callback, _consume)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 503, in ensure
(nova.compute.manager): TRACE: error_callback(e)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 553, in _error_callback
(nova.compute.manager): TRACE: raise rpc_common.Timeout()
(nova.compute.manager): TRACE: Timeout: Timeout while waiting on RPC response.
(nova.compute.manager): TRACE:
2012-04-04 14:21:10 ERROR nova.rpc.amqp [-] Exception during message handling
(nova.rpc.amqp): TRACE: Traceback (most recent call last):
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 252, in _process_data
(nova.rpc.amqp): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 114, in wrapped
(nova.rpc.amqp): TRACE: return f(*args, **kw)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 177, in decorated_function
(nova.rpc.amqp): TRACE: sys.exc_info())
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
(nova.rpc.amqp): TRACE: self.gen.next()
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 171, in decorated_function
(nova.rpc.amqp): TRACE: return function(self, context, instance_uuid, *args, **kwargs)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 648, in run_instance
(nova.rpc.amqp): TRACE: self._run_instance(context, instance_uuid, **kwargs)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 451, in _run_instance
(nova.rpc.amqp): TRACE: self._set_instance_error_state(context, instance_uuid)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
(nova.rpc.amqp): TRACE: self.gen.next()
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 424, in _run_instance
(nova.rpc.amqp): TRACE: requested_networks)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 559, in _allocate_network
(nova.rpc.amqp): TRACE: requested_networks=requested_networks)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/api.py", line 170, in allocate_for_instance
(nova.rpc.amqp): TRACE: 'args': args})
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/__init__.py", line 68, in call
(nova.rpc.amqp): TRACE: return _get_impl().call(context, topic, msg, timeout)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 674, in call
(nova.rpc.amqp): TRACE: return rpc_amqp.call(context, topic, msg, timeout, Connection.pool)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 338, in call
(nova.rpc.amqp): TRACE: rv = list(rv)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 299, in __iter__
(nova.rpc.amqp): TRACE: self._iterator.next()
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 572, in iterconsume
(nova.rpc.amqp): TRACE: yield self.ensure(_error_callback, _consume)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 503, in ensure
(nova.rpc.amqp): TRACE: error_callback(e)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 553, in _error_callback
(nova.rpc.amqp): TRACE: raise rpc_common.Timeout()
(nova.rpc.amqp): TRACE: Timeout: Timeout while waiting on RPC response.
(nova.rpc.amqp): TRACE:

Then the instance is destroyed because it couldn't come up cleanly. Restarting nova-network on the network manager, then nova-compute on the compute node seems to fix this for a time, but it recurs after a few hours. Is there further debugging information we can provide? I haven't found log messages that appear related in nova-network or the rabbitmq logs.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

This should only happen if network is blocking for a long time or the network node is not connected for some reason. Are you running only one nova-network? If so it could be getting overloaded with requests. You might try using multi_host mode and running one nova-network on each compute host.
you might try:
rabbitmqctl list_queues

it should show you how many messages are waiting in the queue. If the number is greater than 0 either the worker is blocking doing some work or it can't keep up with the messages.

FYI you can also change the timeout using:
rpc_response_timeout=<timeout_seconds>

so you could set it much higher and see if that helps.

Changed in nova:
status: New → Incomplete
Revision history for this message
Vish Ishaya (vishvananda) wrote :

Also, if you are using one network node, it may be getting overloaded with lease and release requests. You should probably extend the lease time to something more reasonable than 2 minutes via:

dhcp_lease_time=86400 # set the lease time to 24 hours
fixed_ip_disassociate_timeout=172800 # timeout expired leases after 48 hours

Note if you are spawning a lot of instances you will definitely want:

force_dhcp_release=true

or you will have to wait 48 hours to get your ips back and you risk using up all of them.

Revision history for this message
Anthony Young (sleepsonthefloor) wrote :

I have no reproduce steps, but I did bump into this yesterday. There was an issue with an unreleased iptables lock that froze nova-network.

Revision history for this message
David Lawson (deej) wrote :

We are using one network node, I'll see if I can provoke this issue and check the network queue to see if events are stacking up in there. Likewise, we'll tweak some of the lease timing and see if that helps. I'm puzzled by why it appears to require restarting nova-network to fix this though, presumably eventually the queue would drain of events and things would return to normal.

Revision history for this message
Vish Ishaya (vishvananda) wrote : Re: [Bug 973609] Re: Instances go to error state with RPC timeout

Hmm, good point. Perhaps anthony is right and there is a lockfile somewhere. Although in that case I would expect to see a message in the log "attempting to grab lock..." with no further output from that greenthread.

Might help if you could tar up your network log from the failed state and attach it. Is nova-network running by itself? Or are there other workers on the box?

Vish

Revision history for this message
David Lawson (deej) wrote :

I'm not seeing anything stacking up in the network message queue, but I did find this in the nova-network.log:

2012-04-04 19:58:55 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=5853) inner /usr/lib/python2.7/dist-packages/nova/utils.py:891
2012-04-04 19:58:55 DEBUG nova.network.linux_net [-] IPTablesManager.apply completed with success from (pid=5853) apply /usr/lib/python2.7/dist-packages/nova/network/linux_net.py:340
2012-04-04 19:58:55 ERROR nova.rpc.amqp [-] Exception during message handling
(nova.rpc.amqp): TRACE: Traceback (most recent call last):
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 252, in _process_data
(nova.rpc.amqp): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 263, in wrapped
(nova.rpc.amqp): TRACE: return func(self, context, *args, **kwargs)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 469, in associate_floating_ip
(nova.rpc.amqp): TRACE: fixed_address, interface)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 490, in _associate_floating_ip
(nova.rpc.amqp): TRACE: interface)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/l3.py", line 99, in add_floating_ip
(nova.rpc.amqp): TRACE: linux_net.ensure_floating_forward(floating_ip, fixed_ip)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/linux_net.py", line 507, in ensure_floating_forward
(nova.rpc.amqp): TRACE: iptables_manager.apply()
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 907, in inner
(nova.rpc.amqp): TRACE: retval = f(*args, **kwargs)
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/lockfile.py", line 230, in __exit__
(nova.rpc.amqp): TRACE: self.release()
(nova.rpc.amqp): TRACE: File "/usr/lib/python2.7/dist-packages/lockfile.py", line 273, in release
(nova.rpc.amqp): TRACE: raise NotMyLock
(nova.rpc.amqp): TRACE: NotMyLock
(nova.rpc.amqp): TRACE:
2012-04-04 19:58:55 ERROR nova.rpc.amqp [-] Returning exception to caller

I'll check the log to see if anything needs redacting and tar it up for you.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

Are you running off of reasonably current code? Haven't seen NotMyLock since bug 956313 was fixed.

Revision history for this message
David Lawson (deej) wrote :

nova-network 2012.1~rc1~20120309.13261-0ubuntu1

It LOOKS like it should have the fix for bug 956313 in it, but I'm honestly not sure, we'll be upgrading to the latest on Monday I believe, though if you think the latest code will fix this issue we may expedite it. Do you still want a log file?

Revision history for this message
Vish Ishaya (vishvananda) wrote :

Should be pretty easy to determine. Check utils.py and see if it contains the GreenFileLock class.

Revision history for this message
David Lawson (deej) wrote :

I'm assuming you meant /usr/lib/python2.7/dist-packages/nova/utils.py, if so it doesn't have a GreenFileLock class. I assume that means we need to update and this will be resolved.

Revision history for this message
Vish Ishaya (vishvananda) wrote :
Download full text (6.8 KiB)

Yes. Should be resolved with the update then. That is good news.
On Apr 5, 2012 6:35 PM, "David Lawson" <email address hidden> wrote:

> I'm assuming you meant /usr/lib/python2.7/dist-packages/nova/utils.py,
> if so it doesn't have a GreenFileLock class. I assume that means we
> need to update and this will be resolved.
>
> --
> You received this bug notification because you are subscribed to
> OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/973609
>
> Title:
> Instances go to error state with RPC timeout
>
> Status in OpenStack Compute (Nova):
> Incomplete
>
> Bug description:
> We've been seeing a trend of people being unable to start instances,
> investigation yields the following tracebacks in the nova-compute.log
> on the relevant compute nodes:
>
> 2012-04-04 14:21:10 ERROR nova.compute.manager [-] [instance:
> d86678fb-2a29-4b4e-84b4-c1cb2d81a6e2] Instance failed network setup
> (nova.compute.manager): TRACE: Traceback (most recent call last):
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 559, in
> _allocate_network
> (nova.compute.manager): TRACE: requested_networks=requested_networks)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/network/api.py", line 170, in
> allocate_for_instance
> (nova.compute.manager): TRACE: 'args': args})
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/__init__.py", line 68, in call
> (nova.compute.manager): TRACE: return _get_impl().call(context,
> topic, msg, timeout)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 674, in call
> (nova.compute.manager): TRACE: return rpc_amqp.call(context, topic,
> msg, timeout, Connection.pool)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 338, in call
> (nova.compute.manager): TRACE: rv = list(rv)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 299, in __iter__
> (nova.compute.manager): TRACE: self._iterator.next()
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 572, in
> iterconsume
> (nova.compute.manager): TRACE: yield self.ensure(_error_callback,
> _consume)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 503, in
> ensure
> (nova.compute.manager): TRACE: error_callback(e)
> (nova.compute.manager): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 553, in
> _error_callback
> (nova.compute.manager): TRACE: raise rpc_common.Timeout()
> (nova.compute.manager): TRACE: Timeout: Timeout while waiting on RPC
> response.
> (nova.compute.manager): TRACE:
> 2012-04-04 14:21:10 ERROR nova.rpc.amqp [-] Exception during message
> handling
> (nova.rpc.amqp): TRACE: Traceback (most recent call last):
> (nova.rpc.amqp): TRACE: File
> "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 252, in
> _process_data
> (nova.rpc.amqp): TRACE: rval = node_func(cont...

Read more...

David Lawson (deej)
Changed in nova:
status: Incomplete → Fix Released
Thierry Carrez (ttx)
Changed in nova:
status: Fix Released → Invalid
Revision history for this message
Israel Calvete (icalvete) wrote :
Download full text (3.4 KiB)

I have same issue in a test deployment with only one node.

root@ubuntu01:~# nova-manage service list
Binary Host Zone Status State Updated_At
nova-scheduler ubuntu01 internal enabled :-) 2013-10-06 16:43:28
nova-conductor ubuntu01 internal enabled :-) 2013-10-06 16:43:28
nova-network ubuntu01 internal enabled :-) 2013-10-06 16:43:24
nova-cert ubuntu01 internal enabled :-) 2013-10-06 16:43:22
nova-consoleauth ubuntu01 internal enabled :-) 2013-10-06 16:43:28
nova-compute ubuntu01 nova enabled :-) 2013-10-06 16:43:29

root@ubuntu01:~# nova network-show dc6cc0b2-bed5-45af-9dbe-6db52dd9579e
+---------------------+--------------------------------------+
| Property | Value |
+---------------------+--------------------------------------+
| bridge | br100 |
| vpn_public_port | None |
| dhcp_start | 192.168.210.2 |
| bridge_interface | eth0 |
| updated_at | 2013-10-06T15:42:47.000000 |
| id | dc6cc0b2-bed5-45af-9dbe-6db52dd9579e |
| cidr_v6 | None |
| deleted_at | None |
| gateway | 192.168.210.1 |
| rxtx_base | None |
| label | private |
| priority | None |
| project_id | None |
| vpn_private_address | None |
| deleted | 0 |
| vlan | None |
| broadcast | 192.168.210.255 |
| netmask | 255.255.255.0 |
| injected | False |
| cidr | 192.168.210.0/24 |
| vpn_public_address | None |
| multi_host | False |
| dns2 | None |
| created_at | 2013-10-06T15:41:01.000000 |
| host | ubuntu01 |
| gateway_v6 | None |
| netmask_v6 | None |
| dns1 | 8.8.4.4 |
+---------------------+--------------------------------------+

root@ubuntu01:~# ifconfig br100
br100 Link encap:Ethernet direcciónHW 00:0c:29:85:46:af
          Direc. inet:192.168.210.1 Difus.:192.168.210.255 Másc:255.255.255.0
          Dirección inet6: fe80::a007:67ff:fe5e:90e3/64 Alcance:Enlace
          ACTIVO ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.