Bug #1336755 “telemetry_notification_api.py test times out waiti...” : Bugs : Ceilometer

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-07-03:

#1

As of writing this, this issue is top Gate bug -- 54 fails in 24hrs

gordon chung (chungg) on 2014-07-03

Changed in ceilometer:
importance:	Undecided → Critical
status:	New → Triaged

Revision history for this message

gordon chung (chungg) wrote on 2014-07-03:

#2

from compute-agent log

2014-07-03 14:08:05.511 17376 ERROR ceilometer.agent [-] Unable to discover resources: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent Traceback (most recent call last):
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/agent.py", line 166, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent discovered = discoverer.discover(param)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/compute/discovery.py", line 32, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent instances = self.nova_cli.instance_get_all_by_host(cfg.CONF.host)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 34, in with_logging
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return func(*args, **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 125, in instance_get_all_by_host
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent search_opts=search_opts))
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/v1_1/servers.py", line 594, in list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return self._list("/servers%s%s" % (detail, query_string), "servers")
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/base.py", line 67, in _list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent _resp, body = self.api.client.get(url)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 482, in get
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent return self._cs_request(url, 'GET', **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 448, in _cs_request
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent self.authenticate()
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/client.py", line 558, in authenticate
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent port = magic_tuple.port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 91, in port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent host, port = parse_host_port(netloc)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 57, in parse_host_port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent if address[0] == '[':
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent IndexError: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent

from compute-agent log

2014-07-03 14:08:05.511 17376 ERROR ceilometer.agent [-] Unable to discover resources: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent Traceback (most recent call last):
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/ceilometer/ceilometer/agent.py", line 166, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     discovered = discoverer.discover(param)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/ceilometer/ceilometer/compute/discovery.py", line 32, in discover
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     instances = self.nova_cli.instance_get_all_by_host(cfg.CONF.host)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 34, in with_logging
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     return func(*args, **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/ceilometer/ceilometer/nova_client.py", line 125, in instance_get_all_by_host
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     search_opts=search_opts))
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/v1_1/servers.py", line 594, in list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     return self._list("/servers%s%s" % (detail, query_string), "servers")
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/base.py", line 67, in _list
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     _resp, body = self.api.client.get(url)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/client.py", line 482, in get
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     return self._cs_request(url, 'GET', **kwargs)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/client.py", line 448, in _cs_request
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     self.authenticate()
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/client.py", line 558, in authenticate
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     port = magic_tuple.port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 91, in port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     host, port = parse_host_port(netloc)
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent   File "/opt/stack/new/python-novaclient/novaclient/openstack/common/network_utils.py", line 57, in parse_host_port
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent     if address[0] == '[':
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent IndexError: string index out of range
2014-07-03 14:08:05.511 17376 TRACE ceilometer.agent

Changed in ceilometer:
assignee:	nobody → gordon chung (chungg)

Revision history for this message

Eoghan Glynn (eglynn) wrote on 2014-07-04:

#3

Bug to be addressed via by reverting:

https://review.openstack.org/97107

via this proposed revert:

https://review.openstack.org/104597

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-07-10:

#4

logstash query:
message:"Sample for " AND message: "has not been added to the database within" AND tags:"console"
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiU2FtcGxlIGZvciBcIiBBTkQgbWVzc2FnZTogXCJoYXMgbm90IGJlZW4gYWRkZWQgdG8gdGhlIGRhdGFiYXNlIHdpdGhpblwiIEFORCB0YWdzOlwiY29uc29sZVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA1MDA1NzU4MjUzLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-07-11:

#5

Even with the revert merged days ago, seems it's still happening...

http://logs.openstack.org/81/105881/3/gate/gate-tempest-dsvm-postgres-full/3d96abe/console.html#_2014-07-11_17_18_15_096

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-07-11:

#6

In fact, the logstash query in comment #4 indicates it ramped back up several days after the revert.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2014-07-11:

#7

Actually, it looks like this might be a too-broad logstash query, since I don't find the same agent errors as in the bug description... http://logs.openstack.org/81/105881/3/gate/gate-tempest-dsvm-postgres-full/3d96abe/logs/screen-ceilometer-acompute.txt.gz?level=TRACE

Revision history for this message

Attila Fazekas (afazekas) wrote on 2014-07-23:

#8

The issue looks like the issue can happen more frequently with qpid messaging and with neutron networking.

gordon chung (chungg) on 2014-07-23

Changed in ceilometer:
assignee:	gordon chung (chungg) → nobody

Revision history for this message

gordon chung (chungg) wrote on 2014-07-23:

#9

currently debugging... either the number of messages increased substantially or something is negatively affected collector performance.

seems to be ~4min delay between receiving message in notification agent:
2014-07-23 00:03:17.634 28684 INFO ceilometer.network.notifications [-] network notification {u'_context_roles': [u'_member_', u'admin'], u'_context_request_id': u'req-1a56fae5-05e8-499f-ba8d-049778517b1e', u'_context_read_deleted': u'no', 'event_type': u'port.create.end', u'_context_user_name': u'neutron', u'_context_project_name': u'service', 'timestamp': u'2014-07-23 00:03:17.522174', u'_context_tenant_name': u'service', u'_context_tenant': u'a30c10acdcdf4a8982749b798df12e63', 'message_id': u'71a89d5e-6c4d-4a29-b27f-27761ba6c7cd', 'priority': 'info', u'_context_is_admin': True, u'_context_timestamp': u'2014-07-23 00:03:16.928028', u'_context_project_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_tenant_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_user': u'4dadb69da15c477e8ad5807bb51545ac', u'_context_user_id': u'4dadb69da15c477e8ad5807bb51545ac', 'publisher_id': u'network.devstack-trusty-hpcloud-b4-1037295', 'payload': {u'port': {u'status': u'DOWN', u'binding:host_id': u'devstack-trusty-hpcloud-b4-1037295', u'name': u'', u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': u'30ace1bf-0fe9-4fe9-9050-c8177eb46943', u'tenant_id': u'33727d88b92e4fccb6f1cee02e928a1f', u'extra_dhcp_opts': [], u'binding:vnic_type': u'normal', u'binding:vif_type': u'ovs', u'device_owner': u'compute:None', u'mac_address': u'fa:16:3e:a9:ef:15', u'id': u'fa55a4e9-c7f9-4e39-b4d3-dfe8179d401a', u'binding:profile': {}, u'fixed_ips': [{u'subnet_id': u'12c84344-654e-4054-92f0-e60ddeb81c8c', u'ip_address': u'10.100.0.2'}], u'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, u'security_groups': [u'ccb7c2d3-6dcc-4221-a27d-36fe43a93bec'], u'device_id': u'35ce924d-27f7-4f79-a358-fdad61b2a41b'}}}

and writing it to db:

2014-07-23 00:07:40.124 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance:m1.nano for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.147 28683 DEBUG ceilometer.dispatcher.database [-] metering data vcpus for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.212 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.233 28683 DEBUG ceilometer.dispatcher.database [-] metering data memory for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 64 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55

currently debugging... either the number of messages increased substantially or something is negatively affected collector performance.

seems to be ~4min delay between receiving message in notification agent:
2014-07-23 00:03:17.634 28684 INFO ceilometer.network.notifications [-] network notification {u'_context_roles': [u'_member_', u'admin'], u'_context_request_id': u'req-1a56fae5-05e8-499f-ba8d-049778517b1e', u'_context_read_deleted': u'no', 'event_type': u'port.create.end', u'_context_user_name': u'neutron', u'_context_project_name': u'service', 'timestamp': u'2014-07-23 00:03:17.522174', u'_context_tenant_name': u'service', u'_context_tenant': u'a30c10acdcdf4a8982749b798df12e63', 'message_id': u'71a89d5e-6c4d-4a29-b27f-27761ba6c7cd', 'priority': 'info', u'_context_is_admin': True, u'_context_timestamp': u'2014-07-23 00:03:16.928028', u'_context_project_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_tenant_id': u'a30c10acdcdf4a8982749b798df12e63', u'_context_user': u'4dadb69da15c477e8ad5807bb51545ac', u'_context_user_id': u'4dadb69da15c477e8ad5807bb51545ac', 'publisher_id': u'network.devstack-trusty-hpcloud-b4-1037295', 'payload': {u'port': {u'status': u'DOWN', u'binding:host_id': u'devstack-trusty-hpcloud-b4-1037295', u'name': u'', u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': u'30ace1bf-0fe9-4fe9-9050-c8177eb46943', u'tenant_id': u'33727d88b92e4fccb6f1cee02e928a1f', u'extra_dhcp_opts': [], u'binding:vnic_type': u'normal', u'binding:vif_type': u'ovs', u'device_owner': u'compute:None', u'mac_address': u'fa:16:3e:a9:ef:15', u'id': u'fa55a4e9-c7f9-4e39-b4d3-dfe8179d401a', u'binding:profile': {}, u'fixed_ips': [{u'subnet_id': u'12c84344-654e-4054-92f0-e60ddeb81c8c', u'ip_address': u'10.100.0.2'}], u'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, u'security_groups': [u'ccb7c2d3-6dcc-4221-a27d-36fe43a93bec'], u'device_id': u'35ce924d-27f7-4f79-a358-fdad61b2a41b'}}}

and writing it to db:

2014-07-23 00:07:40.124 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance:m1.nano for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.147 28683 DEBUG ceilometer.dispatcher.database [-] metering data vcpus for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.212 28683 DEBUG ceilometer.dispatcher.database [-] metering data instance for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 1 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55
2014-07-23 00:07:40.233 28683 DEBUG ceilometer.dispatcher.database [-] metering data memory for 35ce924d-27f7-4f79-a358-fdad61b2a41b @ 2014-07-23 00:03:15.433578: 64 record_metering_data /opt/stack/new/ceilometer/ceilometer/dispatcher/database.py:55

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-07-23:

#10

This is currently the top cause of failures in the neutron full job.
And the neutron full job is where this error occurs more frequently:
http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiU2FtcGxlIGZvciBtZXRyaWM6bWVtb3J5IHdpdGggcXVlcnk6XCIgQU5EIG1lc3NhZ2U6XCJoYXMgbm90IGJlZW4gYWRkZWQgdG8gdGhlIGRhdGFiYXNlIHdpdGhpblwiIEFORCB0YWdzOmNvbnNvbGUiLCJ0aW1lZnJhbWUiOiI4NjQwMDAiLCJncmFwaG1vZGUiOiJjb3VudCIsIm9mZnNldCI6MCwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwibW9kZSI6InNjb3JlIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX25hbWUifQ==

The actual frequency is even worse if one considers the neutron full job runs only for neutron and tempest patches.
So I would support the hypothesis of something affecting collector performance.

Revision history for this message

gordon chung (chungg) wrote on 2014-07-24:

#11

very strange that it's not consistent. looking at history, there are multiple runs where the tests will finish in a few seconds while other times either the XML or JSON test will be >180s and the other a few seconds or sometimes both XML and JSON tests are both > 150s... it seems like it really varies based on load... not sure if it's MQ load or DB load?

Revision history for this message

Vadim Rovachev (vrovachev) wrote on 2014-07-28:

#12

soon i will try to reproduce this bug

Changed in ceilometer:
assignee:	nobody → Vadim Rovachev (vrovachev)

Revision history for this message

Chris Dent (cdent) wrote on 2014-10-27:

#13

Can we close this yet?

Revision history for this message

Kashyap Chamarthy (kashyapc) wrote on 2014-10-27:

#14

Chris: If there's no further occurance of this bug in the last few weeks of CI runs, then probably this can be closed, but it'd be nice to know if this was ever tracked down to a specific root cause.

Revision history for this message

Chris Dent (cdent) wrote on 2014-10-27:

#15

I think, unfortunately, that there was a complex of issues that was causing this particular problem and it's not clear which fix or fixes fixed this one in particular. I do know that at least two problems were likely relevant:

* screen in the gate has been notoriously flakey about randomly not starting process and not reporting why. Screen is no longer being used on those runs, so that's probably helping

* in earlier devstacks the ordering of the compute agent and the collector were bad leading to notifications sometimes being missed

Both of these things have been changed.

Chris Dent (cdent) on 2014-10-27

Changed in ceilometer:
status:	Triaged → Fix Released

Revision history for this message

Chi Lo (cl566n) wrote on 2016-02-18:

#16

I still see this error in my Devstack which is based on Kilo. How do I pick up this change?

Ceilometer

telemetry_notification_api.py test times out waiting for Nova notification

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches