ZeroMQ unit tests crash on Fedora 17

Bug #1009793 reported by Russell Bryant
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
oslo-incubator
Fix Released
Low
Unassigned

Bug Description

On Fedora 17:

$ ./run_tests.sh -N nova.tests.rpc.test_zmq
RpcZmqBaseTopicTestCase
    test_call_succeed OK 0.86
    test_call_succeed_despite_multiple_returns_yield ./run_tests.sh: line 76: 25948 Segmentation fault ${wrapper} $NOSETESTS 2> run_tests.log

]$ cat run_tests.log
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py:117: SADeprecationWarning: The 'listeners' argument to Pool (and create_engine()) is deprecated. Use event.listen().
  pool = poolclass(creator, **pool_args)

2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/pool.py:159: SADeprecationWarning: Pool.add_listener is deprecated. Use event.listen()
  self.add_listener(l)

2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/reflection.py:47: SAWarning: Did not recognize type 'BIGINT' of column 'bw_in'
  ret = fn(self, con, *args, **kw)

2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/reflection.py:47: SAWarning: Did not recognize type 'BIGINT' of column 'bw_out'
  ret = fn(self, con, *args, **kw)

2012-06-06 21:33:26 INFO nova.tests.rpc.test_zmq [-] Running internal zmq receiver.
2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor
2012-06-06 21:33:26 INFO nova.rpc.common [-] In reactor registered
2012-06-06 21:33:26 INFO nova.rpc.common [-] Create Consumer for topic (test)
2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor
2012-06-06 21:33:26 INFO nova.rpc.common [-] In reactor registered
2012-06-06 21:33:26 INFO nova.rpc.common [-] Consuming socket
2012-06-06 21:33:26 INFO nova.rpc.common [-] Consuming socket
2012-06-06 21:33:26 INFO nova.rpc.common [-] Created topic proxy: test
2012-06-06 21:33:26 INFO nova.tests.rpc.test_zmq [-] Running internal zmq receiver.
2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor

Changed in nova:
status: New → Confirmed
description: updated
Changed in nova:
importance: Undecided → High
Revision history for this message
Erica Windisch (ewindisch) wrote : Re: [Bug 1009793] Re: ZeroMQ unit tests crash on Fedora 17

What is your zeromq version?

Sent from my iPad

On Jun 6, 2012, at 21:41, Russell Bryant <email address hidden> wrote:

> ** Description changed:
>
> On Fedora 17:
> -
>
> $ ./run_tests.sh -N nova.tests.rpc.test_zmq
> RpcZmqBaseTopicTestCase
> - test_call_succeed OK 0.86
> - test_call_succeed_despite_multiple_returns_yield ./run_tests.sh: line 76: 25948 Segmentation fault ${wrapper} $NOSETESTS 2> run_tests.log
> + test_call_succeed OK 0.86
> + test_call_succeed_despite_multiple_returns_yield ./run_tests.sh: line 76: 25948 Segmentation fault ${wrapper} $NOSETESTS 2> run_tests.log
> +
> +
> +
> + ]$ cat run_tests.log
> + nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
> + 2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py:117: SADeprecationWarning: The 'listeners' argument to Pool (and create_engine()) is deprecated. Use event.listen().
> + pool = poolclass(creator, **pool_args)
> +
> + 2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/pool.py:159: SADeprecationWarning: Pool.add_listener is deprecated. Use event.listen()
> + self.add_listener(l)
> +
> + 2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/reflection.py:47: SAWarning: Did not recognize type 'BIGINT' of column 'bw_in'
> + ret = fn(self, con, *args, **kw)
> +
> + 2012-06-06 21:33:25 WARNING nova.utils [-] /usr/lib64/python2.7/site-packages/sqlalchemy/engine/reflection.py:47: SAWarning: Did not recognize type 'BIGINT' of column 'bw_out'
> + ret = fn(self, con, *args, **kw)
> +
> + 2012-06-06 21:33:26 INFO nova.tests.rpc.test_zmq [-] Running internal zmq receiver.
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] In reactor registered
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Create Consumer for topic (test)
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] In reactor registered
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Consuming socket
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Consuming socket
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Created topic proxy: test
> + 2012-06-06 21:33:26 INFO nova.tests.rpc.test_zmq [-] Running internal zmq receiver.
> + 2012-06-06 21:33:26 INFO nova.rpc.common [-] Registering reactor
>
> ** Changed in: nova
> Importance: Undecided => High
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1009793
>
> Title:
> ZeroMQ unit tests crash on Fedora 17
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/1009793/+subscriptions

Revision history for this message
Russell Bryant (russellb) wrote :

$ rpm -q zeromq python-zmq
zeromq-2.1.11-1.fc17.x86_64
python-zmq-2.1.9-4.fc17.x86_64

affects: nova → openstack-common
David Ripton (dripton)
Changed in openstack-common:
assignee: nobody → David Ripton (dripton)
Revision history for this message
David Ripton (dripton) wrote :

The first test succeeds. The second test fails, when it attempts to bind "tcp://127.0.0.1:9502", with "ZMQError: Address already in use", which then gets wrapped into RPCException("Could not open socket").

So the obvious conclusion is that we're not closing ZMQ sockets after we're done with them.

If I comment out the first test, then the second test succeeds and the third test fails, supporting that conclusion.

But I don't see why the cleanup code fails yet. I see code that does attempt to clean them up (tearDown calls ZmqBaseReactor.close, which iterates over its sockets and calls close() on each.) And it's being called.

Revision history for this message
Erica Windisch (ewindisch) wrote :

David - the 9502 port is bound only in the latest master per the last commit - I'm going to submit a patch to revert. This was for some intended message format changes which we decided wouldn't go into Folsom. I might reapply later, but only after we have versioning (and possibly, more testing?)

Revision history for this message
Erica Windisch (ewindisch) wrote :

David: I have also seen some unconfirmed reports of descriptor leakage indicating unclosed sockets. As you say, the code SEEMS to be closing them correctly in every case. I was very careful to throughly test the try: finally blocks and all uses of 'with'.

These are likely red herrings to this bug report, though.

Revision history for this message
David Ripton (dripton) wrote :

test_zmq.py fails 100% reliably for me, but it fails differently on Ubuntu 12.04 and Fedora 17.

On Ubuntu, I get 18 successes and 2 failures (RpcZmqBaseTopicTestCase test_fanout_success and RpcZmqDirectTopicTestCase test_fanout_success).

On Fedora, 3 successes and 17 failures. With most of the failures of the "could not open socket because it's already open" variety.

My first hypothesis, after seeing that the code was basically sane, was that it's eventlet's fault. (Because eventlet does spooky monkeypatching of core Python libraries that mean you have to question all your assumptions.) Fedora has eventlet 0.9.17, versus Ubuntu's 0.9.16. I used "pip install --upgrade eventlet" on my Ubuntu box to upgrade its eventlet to 0.9.17, and got 17 errors.

So I suspect something changed between eventlet 0.9.16 and 0.9.17 that's bad for this code. I don't know exactly what yet.

Revision history for this message
David Ripton (dripton) wrote :

I tried hg bisect on eventlet to try to find the "bad" commit. Unfortunately, I started getting test failures on every eventlet revision I tested, even the 0.9.16 on Ubuntu configuration that seemed to almost work before. (There were a bunch of zeromq-related changes in that range, so I still think that zeromq version is the main problem; I just haven't proven it yet.)

Revision history for this message
Erica Windisch (ewindisch) wrote :

The original bug was for segmentation faults, which I'm not sure is happening anymore. At some point, the tests did work for me on Fedora 17 and I was unable to reproduce this bug.

Unfortunately, Eventlet's green.zmq had been effectively rewritten in the Folsom timeframe. In fact, I just submitted a pull request to green.zmq to fix one bug resulting from that... I suspect there are more.

Revision history for this message
David Ripton (dripton) wrote :

Eric, I am still seeing segfaults, sometimes. And sometimes just test failures without segfaults. I haven't figured out the difference yet.

Padraig Brady mentioned that failing to open sockets with address already in use could also be SO_REUSEADDR, rather than only a failure to close a socket. I've seen neither, though.

Mark McLoughlin (markmc)
affects: openstack-common → oslo
Revision history for this message
Mark McLoughlin (markmc) wrote :

Looks like David isn't working on this anymore

Changed in oslo:
assignee: David Ripton (dripton) → nobody
importance: High → Low
Revision history for this message
Erica Windisch (ewindisch) wrote :

I believe this bug was caused by the eventlet 0.9.17 regression of zmq.socket.close() no longer accepting the linger argument. This has had a fix commited per https://bugs.launchpad.net/oslo/+bug/1099185

Additionally, the tests now increment the port number as not to fall into a race condition
 per https://bugs.launchpad.net/oslo/+bug/1096223

Changed in oslo:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.