telemetry_notification_api.py test times out waiting for Nova notification

Bug #1336755 reported by Kashyap Chamarthy
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Critical
Vadim Rovachev

Bug Description

'test_telemetry_notification_api.py' seems to time out waiting on a Nova notification. This failure was noticed during the Jenkins test run. for'gate-tempest-dsvm-postgres-full'

Contextual log from console.html[1]:
--------------------------------
[. . .]
2014-07-02 10:09:54.167 | tempest.api.telemetry.test_telemetry_notification_api.TelemetryNotificationAPITestXML.test_check_nova_notification[gate]
2014-07-02 10:09:54.167 | ------------------------------------------------------------------------------------------------------------------------
2014-07-02 10:09:54.167 |
2014-07-02 10:09:54.167 | Captured traceback:
2014-07-02 10:09:54.167 | ~~~~~~~~~~~~~~~~~~~
2014-07-02 10:09:54.167 | Traceback (most recent call last):
2014-07-02 10:09:54.167 | File "tempest/api/telemetry/test_telemetry_notification_api.py", line 43, in test_check_nova_notification
2014-07-02 10:09:54.167 | self.await_samples(metric, query)
2014-07-02 10:09:54.167 | File "tempest/api/telemetry/base.py", line 95, in await_samples
2014-07-02 10:09:54.167 | CONF.compute.build_timeout))
2014-07-02 10:09:54.167 | TimeoutException: Request timed out
2014-07-02 10:09:54.167 | Details: Sample for metric:memory with query:('resource', 'eq', 'adb024f0-e632-4d24-9f20-2f1e0fc95bba') has not been added to the database within 196 seconds

2014-07-02 10:09:54.168 | Captured pythonlogging:
2014-07-02 10:09:54.168 | ~~~~~~~~~~~~~~~~~~~~~~~
2014-07-02 10:09:54.168 | 2014-07-02 09:49:21,736 24649 INFO [tempest.common.rest_client] Request (TelemetryNotificationAPITestXML:test_check_nova_notification): 200 POST http://127.0.0.1:5000/v2.0/tokens
2014-07-02 10:09:54.168 | 2014-07-02 09:49:22,783 24649 INFO [tempest.common.rest_client] Request (TelemetryNotificationAPITestXML:test_check_nova_notification): 202 POST http://127.0.0.1:8774/v2/1890fb61966b41e6ac772e51f9c9996f/servers 1.046s
2014-07-02 10:09:54.168 | 2014-07-02 09:49:23,927 24649 INFO [tempest.common.rest_client] Request (TelemetryNotificationAPITestXML:test_check_nova_notification): 200 GET http://127.0.0.1:8777/v2/meters/memory?q.op=eq&q.value=adb024f0-e632-4d24-9f20-2f1e0fc95bba&q.field=resource 1.141s
2014-07-02 10:09:54.168 | 2014-07-02 09:49:24,949 24649 INFO [tempest.common.rest_client] Request
[. . .]
--------------------------------

  [1] http://logs.openstack.org/90/102490/4/gate/gate-tempest-dsvm-postgres-full/d931fa7/console.html

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

As of writing this, this issue is top Gate bug -- 54 fails in 24hrs

gordon chung (chungg)
Changed in ceilometer:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
gordon chung (chungg) wrote :

from compute-agent log

2014-07-03 14:08:05.511 17376 ERROR ceilometer.agent [-] Unable to discover resources: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent Traceback (most recent call last):
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/agent.py", line 166, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent discovered = discoverer.discover(param)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/compute/discovery.py", line 32, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent instances = self.nova_cli.instance_get_all_by_host(cfg.CONF.host)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 34, in with_logging
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return func(*args, **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 125, in instance_get_all_by_host
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent search_opts=search_opts))
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/v1_1/servers.py", line 594, in list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return self._list("/servers%s%s" % (detail, query_string), "servers")
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/base.py", line 67, in _list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent _resp, body = self.api.client.get(url)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 482, in get
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return self._cs_request(url, 'GET', **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 448, in _cs_request
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent self.authenticate()
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 558, in authenticate
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent port = magic_tuple.port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 91, in port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent host, port = parse_host_port(netloc)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 57, in parse_host_port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent if address[0] == '[':
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent IndexError: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent

Changed in ceilometer:
assignee: nobody → gordon chung (chungg)
Revision history for this message
Eoghan Glynn (eglynn) wrote :

Bug to be addressed via by reverting:

  https://review.openstack.org/97107

via this proposed revert:

  https://review.openstack.org/104597

Revision history for this message
Attila Fazekas (afazekas) wrote :
Revision history for this message
Jeremy Stanley (fungi) wrote :
Revision history for this message
Jeremy Stanley (fungi) wrote :

In fact, the logstash query in comment #4 indicates it ramped back up several days after the revert.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Actually, it looks like this might be a too-broad logstash query, since I don't find the same agent errors as in the bug description... http://logs.openstack.org/81/105881/3/gate/gate-tempest-dsvm-postgres-full/3d96abe/logs/screen-ceilometer-acompute.txt.gz?level=TRACE

Revision history for this message
Attila Fazekas (afazekas) wrote :

The issue looks like the issue can happen more frequently with qpid messaging and with neutron networking.

gordon chung (chungg)
Changed in ceilometer:
assignee: gordon chung (chungg) → nobody
Revision history for this message
gordon chung (chungg) wrote :

currently debugging... either the number of messages increased substantially or something is negatively affected collector performance.

seems to be ~4min delay between receiving message in notification agent:
2014-07-23 00:03:17.634 28684 INFO ceilometer.network.notifications [-] network notification {u'_context_roles': [u'_member_', u'admin'], u'_context_request_id': u'req-1a56fae5-05e8-499f-ba8d-049778517b1e', u'_context_read_deleted': u'no', 'event_type': u'port.create.end', u'_context_user_name': u'neutron', u'_context_project_name': u'service', 'timestamp': u'2014-07-23 00:03:17.522174', u'_context_tenant_name': u'service', u'_context_tenant': u'a30c10acdcdf4a8982749b798df12e63', 'message_id': u'71a89d5e-6c4d-4a29-b27f-27761ba6c7cd', 'priority': 'info', u'_context_is_admin': True, u'_context_timestamp': u'2014-07-23 00:03:16.928028', u'_context_project_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_tenant_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_user': u'4dadb69da15c477e8ad5807bb51545ac', u'_context_user_id': u'4dadb69da15c477e8ad5807bb51545ac', 'publisher_id': u'network.devstack-trusty-hpcloud-b4-1037295', 'payload': {u'port': {u'status': u'DOWN', u'binding:host_id': u'devstack-trusty-hpcloud-b4-1037295', u'name': u'', u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': u'30ace1bf-0fe9-4fe9-9050-c8177eb46943', u'tenant_id': u'33727d88b92e4fccb6f1cee02e928a1f', u'extra_dhcp_opts': [], u'binding:vnic_type': u'normal', u'binding:vif_type': u'ovs', u'device_owner': u'compute:None', u'mac_address': u'fa:16:3e:a9:ef:15', u'id': u'fa55a4e9-c7f9-4e39-b4d3-dfe8179d401a', u'binding:profile': {}, u'fixed_ips': [{u'subnet_id': u'12c84344-654e-4054-92f0-e60ddeb81c8c', u'ip_address': u'10.100.0.2'}], u'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, u'security_groups': [u'ccb7c2d3-6dcc-4221-a27d-36fe43a93bec'], u'device_id': u'35ce924d-27f7-4f79-a358-fdad61b2a41b'}}}

and writing it to db:

2014-07-23 00:07:40.124 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance:m1.nano for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.147 28683 DEBUG ceilometer.dispatcher.database [-] metering data vcpus for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.212 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.233 28683 DEBUG ceilometer.dispatcher.database [-] metering data memory for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 64 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

This is currently the top cause of failures in the neutron full job.
And the neutron full job is where this error occurs more frequently:
http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiU2FtcGxlIGZvciBtZXRyaWM6bWVtb3J5IHdpdGggcXVlcnk6XCIgQU5EIG1lc3NhZ2U6XCJoYXMgbm90IGJlZW4gYWRkZWQgdG8gdGhlIGRhdGFiYXNlIHdpdGhpblwiIEFORCB0YWdzOmNvbnNvbGUiLCJ0aW1lZnJhbWUiOiI4NjQwMDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwibW9kZSI6InNjb3JlIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX25hbWUifQ==

The actual frequency is even worse if one considers the neutron full job runs only for neutron and tempest patches.
So I would support the hypothesis of something affecting collector performance.

Revision history for this message
gordon chung (chungg) wrote :

very strange that it's not consistent. looking at history, there are multiple runs where the tests will finish in a few seconds while other times either the XML or JSON test will be >180s and the other a few seconds or sometimes both XML and JSON tests are both > 150s... it seems like it really varies based on load... not sure if it's MQ load or DB load?

Revision history for this message
Vadim Rovachev (vrovachev) wrote :

soon i will try to reproduce this bug

Changed in ceilometer:
assignee: nobody → Vadim Rovachev (vrovachev)
Revision history for this message
Chris Dent (cdent) wrote :

Can we close this yet?

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Chris: If there's no further occurance of this bug in the last few weeks of CI runs, then probably this can be closed, but it'd be nice to know if this was ever tracked down to a specific root cause.

Revision history for this message
Chris Dent (cdent) wrote :

I think, unfortunately, that there was a complex of issues that was causing this particular problem and it's not clear which fix or fixes fixed this one in particular. I do know that at least two problems were likely relevant:

* screen in the gate has been notoriously flakey about randomly not starting process and not reporting why. Screen is no longer being used on those runs, so that's probably helping

* in earlier devstacks the ordering of the compute agent and the collector were bad leading to notifications sometimes being missed

Both of these things have been changed.

Chris Dent (cdent)
Changed in ceilometer:
status: Triaged → Fix Released
Revision history for this message
Chi Lo (cl566n) wrote :

I still see this error in my Devstack which is based on Kilo. How do I pick up this change?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.