Quantum service does not restart after reboot

Bug #1073999 reported by Gary Kotton
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Undecided
Unassigned
Folsom
Fix Released
Critical
Eric Harney
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
neutron
Fix Released
Critical
Gary Kotton
Folsom
Fix Released
Critical
Gary Kotton
oslo-incubator
Fix Released
Critical
Gary Kotton
Folsom
Fix Committed
Undecided
Unassigned
Grizzly
Fix Released
Critical
Gary Kotton
quantum (Ubuntu)
Fix Released
Undecided
Unassigned
Quantal
Fix Released
Undecided
Unassigned

Bug Description

Hi,
If the Quantum service starts before QPID then it does not listen on port 9696. The reason for is that the connection setup with the RPC hangs with the call: self.connection.open()" in the method reconnect in impl_qpid.py.
Even when the qpidd service starts this does not work and Quantum is still waiting to get out of this function.
Thanks
Gary

Gary Kotton (garyk)
Changed in quantum:
importance: Undecided → Critical
Gary Kotton (garyk)
tags: added: folsom-backport-potential
Revision history for this message
Gary Kotton (garyk) wrote :

The problem is also reproducible with nova-network.
Simple reproduction:
 - shut down qpidd
 - restart nova-network

Revision history for this message
Russell Bryant (russellb) wrote :

I just tried to reproduce this using nova. Specifically, I did ...

1) Get everything up and running with devstack using qpid
2) Stop qpidd
3) Stop nova-network
4) Start nova-network (and observe it hang at trying to connect to qpidd)
5) Start qpidd
6) Observe nova-network connect to qpidd within 20 seconds or so.

This behavior was repeatedly repduceble for me

Revision history for this message
Gary Kotton (garyk) wrote :
Download full text (6.3 KiB)

Hi Russel,
Thanks for taking a look. I did the following:

1. Installed devstack on fedora
2. Stopped QPID. nov-network had an exception (below [1] - this is a separate issue and I would guess that one woud want to catch the exception and retry to connect)
3. When QPID restarts nova-network manages to reconnect again (with an exception and then terminates).

The example that I had with the nova-network was with packages and not devstack. Any idea why this may be different behaviour?

I will take a look at devstack with Quantum service and post my findings.

Thanks
Gary

[1] Trace of crach for nova-network when qpid terminates

2012-11-03 18:54:41 INFO nova.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on localhost:5672
2012-11-03 18:54:41 DEBUG nova.service [-] Creating Consumer connection for Service network from (pid=7002) start /opt/stack/nova/nova/service.py:404
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 97, in wait
    readers.get(fileno, noop).cb(fileno)
  File "/usr/lib/python2.7/site-packages/eventlet/green/select.py", line 48, in on_read
    current.switch(([original], [], []))
  File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 192, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/service.py", line 123, in run_server
    server.start()
  File "/opt/stack/nova/nova/service.py", line 412, in start
    self.conn.create_consumer(node_topic, rpc_dispatcher, fanout=False)
  File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 136, in create_consumer
    self.connection.create_consumer(topic, proxy, fanout)
  File "/opt/stack/nova/nova/openstack/common/rpc/impl_qpid.py", line 526, in create_consumer
    consumer = TopicConsumer(self.conf, self.session, topic, proxy_cb)
  File "/opt/stack/nova/nova/openstack/common/rpc/impl_qpid.py", line 187, in __init__
    {}, name or topic, {})
  File "/opt/stack/nova/nova/openstack/common/rpc/impl_qpid.py", line 130, in __init__
    self.reconnect(session)
  File "/opt/stack/nova/nova/openstack/common/rpc/impl_qpid.py", line 135, in reconnect
    self.receiver = session.receiver(self.address)
  File "<string>", line 6, in receiver
  File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 619, in receiver
    raise e
MalformedAddress: unrecognized characters line:1,18: openstack/network.(none) ; {"node": {"x-declare": {"auto-delete": true, "durable": true}, "type": "topic"}, "create": "always", "link": {"x-declare": {"auto-delete": true, "exclusive": false, "durable": false}, "durable": true, "name": "network.(none)"}}
Removing descriptor: 9
2012-11-03 18:54:41 CRITICAL nova [-] unrecognized characters line:1,18: openstack/network.(none) ; {"node": {"x-declare": {"auto-delete": true, "durable": true}, "type": "topic"}, "create": "always", "link": {"x-declare": {"auto-delete": true, "exclusive": false, "durable": false}, "durable": true, "name": "network.(none)"}}
2012-11-03 18:54:41 TRACE nova Traceback (most recent call last):
2012-11-03 18:54:41 TRACE nova File "/opt/stack/nova/bin/nova-network", line 50, in <module>
2012-11-03 18:54:41 TRACE nova ...

Read more...

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
When doing the same test with devstack the quantum service also recovers. I will go back to the packaging version and continue debugging. Any ideas?
Thanks
Gary

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
I have done a little extra investigation. In the devstack scanerio described above not all of the service recover when QPID is started.
Thanks
Gary

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
When I remove the call eventlet.monkey_patch() it works! This leaves us between a rock and a hard place as I understand that this is required for the RPC. Correct?
Thanks
Gary

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
The problem is very easily reproducible. Please use the following script (if the eventlet.monkey_patch is commented then the open function will try and reconnect. If this is not commented then the function will hang).
Thanks
Gary

#!/usr/bin/env python

#import eventlet
#eventlet.monkey_patch()

import os
import time

from qpid.messaging import endpoints

print "QPID Test!"

session = None
consumers = {}
consumer_thread = None

default_params = dict(hostname='localhost',
                      port=5672,
                      username='',
                      password='')

params = {}
for key in default_params.keys():
    params.setdefault(key, default_params[key])

broker = params['hostname'] + ":" + str(params['port'])
# Create the connection - this does not open the connection
print "======> broker %s" % broker
connection = endpoints.Connection(broker)

# Check if flags are set and if so set them for the connection
# before we call open
connection.username = params['username']
connection.password = params['password']
connection.sasl_mechanisms = ''
connection.reconnect = True
connection.heartbeat = 60
connection.protocol = 'tcp'
connection.tcp_nodelay = True

while True:
    try:
        connection.open()
    except endpoints.exceptions.ConnectionError, e:
        print 'Unable to connect to AMQP server: %s' % e
        time.sleep(1)
    else:
        break

print 'Connected to AMQP server on %s' % broker

Revision history for this message
Russell Bryant (russellb) wrote :

We certainly cannot remove eventlet.monkey_patch(). All of the nova services are monkey patched, including nova-network which was working fine.

Revision history for this message
Gary Kotton (garyk) wrote :

Hi,
A proposed fix can be seen at https://review.openstack.org/#/c/15558/
Thanks
Gary

Revision history for this message
Russell Bryant (russellb) wrote :
Revision history for this message
Gary Kotton (garyk) wrote :

The fix is currently under review in common - https://review.openstack.org/#/c/15663/

Changed in nova:
assignee: nobody → Gary Kotton (garyk)
Changed in quantum:
assignee: nobody → Gary Kotton (garyk)
milestone: none → grizzly-1
status: New → In Progress
Changed in nova:
assignee: Gary Kotton (garyk) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/15943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/15943
Committed: http://github.com/openstack/quantum/commit/8eb832715117b9f79d6d20d2dd17b6ff7efe4473
Submitter: Jenkins
Branch: master

commit 8eb832715117b9f79d6d20d2dd17b6ff7efe4473
Author: Gary Kotton <email address hidden>
Date: Thu Nov 8 21:16:01 2012 +0000

    Update latest openstack-common code

    This fixes bug 1073999 (quantum/openstack/common/rpc/impl_qpid.py)

    In addition to this the common code is updated.

    Change-Id: I41223963baf34772edcd0d6d7ef5686a5fad1035

Changed in quantum:
status: In Progress → Fix Committed
Chuck Short (zulcss)
Changed in nova:
status: New → Invalid
Mark McLoughlin (markmc)
Changed in oslo:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (stable/folsom)

Reviewed: https://review.openstack.org/16095
Committed: http://github.com/openstack/quantum/commit/82b1a55cc98519240169c27be0652ee00fd1dffc
Submitter: Jenkins
Branch: stable/folsom

commit 82b1a55cc98519240169c27be0652ee00fd1dffc
Author: Gary Kotton <email address hidden>
Date: Sat Nov 10 06:59:33 2012 +0000

    Update stable with stable oslo (aka common)

    This fixes bug 1073999

    Change-Id: I191af50a7b0ab6c3c19fd24757d7466e67549615

Gary Kotton (garyk)
tags: added: in-stable-folsom
removed: folsom-backport-potential
Thierry Carrez (ttx)
Changed in quantum:
status: Fix Committed → Fix Released
Mark McLoughlin (markmc)
Changed in oslo:
importance: Undecided → Critical
assignee: nobody → Gary Kotton (garyk)
Mark McLoughlin (markmc)
Changed in oslo:
milestone: none → grizzly-1
status: Fix Committed → Fix Released
Changed in quantum (Ubuntu):
status: New → Fix Released
Changed in quantum (Ubuntu Quantal):
status: New → Confirmed
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello Gary, or anyone else affected,

Accepted quantum into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/quantum/2012.2.1-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in quantum (Ubuntu Quantal):
status: Confirmed → Fix Committed
tags: added: verification-needed
Mark McLoughlin (markmc)
tags: removed: in-stable-folsom
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.8 KiB)

This bug was fixed in the package quantum - 2012.2.1-0ubuntu1

---------------
quantum (2012.2.1-0ubuntu1) quantal-proposed; urgency=low

  * Resynchronize with stable/folsom (1e774867) (LP: #1085255):
    - [aeabb42] There are routing problems when the dnsmasq port does not come
      first in the routing table (LP: #1083238)
    - [04aab72] Quantum linux bridge not optimized with libvirt (LP: #1078210)
    - [ca7fc10] getting quotas from database has severe performance implications
      (LP: #1075369)
    - [66605e8] failed to update an external network into non external network
      (LP: #1083387)
    - [c60051a] Quantum test suite leaks memory like a sieve (LP: #1065276)
    - [3179dfc] clear_db() does incomplete db teardown (LP: #1080988)
    - [c1e19d7] Unauthorized command: cat /proc/None/cmdline (LP: #1077651)
    - [af9e076] At times a instance will not receive an IP address from the DHCP
      agent (LP: #1081664)
    - [e0d1a7d] allow multiple floating-ip on single port if they use different
      fixed ips and/or external nets (LP: #1057844)
    - [8471d79] Delete port fails to gateway ip (LP: #1079980)
    - [aca8b4a] fixed_ip allocation which is not included within
      allocation_pools makes error when delete port or re-create port
      (LP: #1077292)
    - [eacc9d3] Mapping same bridge to different phyiscal networks succeed
      (LP: #1067669)
    - [51b4c82] python-quantum: not region aware (LP: #1080793)
    - [6f0a486] delete floatingip should be in one transaction to delete port
      (LP: #1080516)
    - [db6cda7] Remove qpid configuration variables no longer supported
    - [a112840] Allow NVP plugin to use per-tenant quota extension
    - [82b1a55] Quantum service does not restart after reboot (LP: #1073999)
    - [c01a839] There are some cases that L3 API with an invalid parameter
      returns 500. (LP: #1064765)
    - [26b383f] external network can be plugged also as internal network for one
      router (LP: #1053633)
    - [49f649c] There is a lot of cases that API with an invalid parameter
      returns 500. (LP: #1062046)
    - [4546a18] When create subnet, you con set up the value as cidr (the value
      isn't cidr form). (LP: #1067959)
    - [9ba453a] killfilter should handle updated/deleted executables
      (LP: #1073768)
    - [7c8a55c] a port which is not able to delete is made when floatingip
      create fails. (LP: #1064748)
    - [c9b84cf] Linux bridge port update causes exception (LP: #1072713)
    - [cb57932] I can't add interface to router, if there is another port in
      non-shared network of other tenant (LP: #1057558)
    - [574e278] Ryu plugin does not support Security Groups (LP: #1059393)
    - [607f486] tap device added to integration bridge without tag
      (LP: #1064070)
    - [21a0fdf] L3 agent external network flag (LP: #1056720)
    - [5cbaff4] router create with external_gateway_info fails with 500 always.
      (LP: #1064235)
    - [63b81f6] l3 db operations failed in multiple transactions (LP: #1070335)
    - [bff17fb] Ensure that the SqlSoup import is still supported.
    - [e091a29] l3_nat_agent was renamed to l3_agent
    - [9030969] remove default value of 'local_ip' of 10...

Read more...

Changed in quantum (Ubuntu Quantal):
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in quantum:
milestone: grizzly-1 → 2013.1
Mark McLoughlin (markmc)
Changed in cinder:
status: New → Incomplete
status: Incomplete → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/folsom)

Reviewed: https://review.openstack.org/22244
Committed: http://github.com/openstack/cinder/commit/ce516f6a2367b50ccfd4a1d061acae109cd9b7d2
Submitter: Jenkins
Branch: stable/folsom

commit ce516f6a2367b50ccfd4a1d061acae109cd9b7d2
Author: Eric Harney <email address hidden>
Date: Mon Feb 18 16:15:45 2013 -0500

    Sync rpc changes from oslo stable/folsom

    Contains code from the following oslo commits:

    9f938720 Update common code to support pep 1.3.
    7e36792c kombu's fanout_cast_to_server was calling wrong method
     - Fixes bug 1074113
    c3ec615c LOG.exception() should only be used in exception handler
    17c5188c Use pep8 v1.3.3
    d147d9f2 Fix QPID reconnect issues
     - Fixes bug 1073999

    Change-Id: Ia4504cb4d8f2108b743efc4639450254d6e3fb8e

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.