Bug #1250168 “gate-tempest-devstack-vm-neutron-large-ops is fail...” : Bugs : neutron

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-11:

#1

Perhaps its related to: http://logs.openstack.org/44/54044/7/check/gate-tempest-devstack-vm-neutron-large-ops/7fdf4a3/logs/screen-q-svc.txt.gz?level=TRACE#_2013-11-11_13_52_48_660 ?

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-11:

#2

This job started failing on Oct 31st
http://graphite.openstack.org/graphlot/?from=-2weeks&until=-1week&target=stats.zuul.pipeline.check.job.gate-tempest-devstack-vm-neutron-large-ops.SUCCESS&target=stats.zuul.pipeline.check.job.gate-tempest-devstack-vm-neutron-large-ops.FAILURE

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-11: Fix proposed to neutron (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/55913

Changed in neutron:
assignee:	nobody → Anita Kuno (anteaya)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-11-11:

#4

Fix proposed to branch: master
Review: https://review.openstack.org/55917

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-12:

#5

The root cause of the issue is the inability of neutron of providing network allocation for 150 instances (the number large_ops spawns) within the timeout of 196 seconds.
NOTE: these operation do not include actual provisioning in the backend. It seems the time is wasted while the concurrent requests executed by _allocate_for_network_async in nova compete for db resources.

This is the symptom of a performance regression that needs to be addressed in neutron.
- Enabling multiple workers on the neutron API does not improve the situation (as the requests are waiting for a resource to become available)
- The same issue is encountered both on the OVS and the ML2 plugin, which might indicate the problem lies either in the IPAM logic or some other shared code in db_base_plugin_v2 (it is actually marginally better with OVS due to less DB operations).
- Changing the build_interval parameter to reduce the frequency of calls to GET /v2/servers/<server_id> (which in turns calls neutron to get instance port info) does not have any impact on the overall time required by test, which further confirm an hypothesis of resource contention in IP allocation.

The error can easily be reproducible locally, running the tests inside a VM with 2GB RAM.

More details will be provided when available.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-12:

#6

I wonder if this is the culprit:

https://review.openstack.org/#/c/52639/

Digging....

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-12:

#7

It looks like increasing the timeout not only confirms Salvatore's assessment, but it is inline with how the fact that the gate started breaking after this change merged (in fact we get a failure in this very change). Obvious short-term solution would be to revert change I7e7c767400ca448cb86d27b60a1229a2afa69726, or even better customize the BUILD_TIMEOUT to 400 (the old value) for the large-ops job until we can figure out a more permanent solution.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-12:

#8

I have pushed a draft change which you've been added as reviewer. As you can see, I managed to get at a job 'SUCCESS' for patchset #1, which is promising.

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-11-12:

#9

I don't see a url for the draft change.

Just as a fyi, drafts are pretty much not used in -infra since it tends to create obstacles for commenting or doing some other operations people sometimes like to do with patches. The usual workflow in -infra is to submit the patch like normal and then mark it as a work in progress - that way all operations are avaliable on the patch for the life of the patch plus anyone who is interested (such as myself) can view the patch.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-12:

#10

Armando, I found that patch too yesterday.
I forgot to mention that this was done on purpose for the large ops job in order to avoid regressions from nova-network.

There's not going to be a revert of that and I don't think neither a different timeout value for neutron I think.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-12:

#11

Summary of the issue:
- it has exploded once the timeout has been reduced to 196
- it is caused by a number of factors
    - IP allocation logic locks the allocation pool, therefore sequentializing create_port request
    - Neutron API layer performs other DB operations, which result time-consuming. In some cases almost a second elapses before the call is dispatched to the plugin. In particular quota checks appear the most time consuming
    - allocate_for_instance in nova.network.neutronv2.api.py makes a number of calls into neutron API, perhaps too many. With neutron under load, this method might take a long time to complete. This is also confirmed by the fact that in some case the request for creating a port arrives to the neutron API even about 40 seconds after the asynchronous network allocation function has been called.
    - there are way too many round-trips to keystone in the process (about 50% of neutron API calls)

Getting neutron large-ops running is therefore an effort which is likely to involve both nova and neutron changes.
The last two bullets in particular, are fairly known areas which have been already designed for improvement; on the other hand it is worth pointing out that this is probably not going to be a low-hanging-fruit level job.

As a further hint, no benefit has been observed by increasing the size of the connection pool in neutron.

------
As an example, running large_ops with just 50 vms in my env completes in an avg time of about 80 seconds.
The bottleneck is allocate_for_instance in nova.network.neutronv2.api.py rather than any specific neutron call.
Considering the longest one, POST /ports.json, the worst case time observed is 4.53 seconds for completing the API request (even of this is quite high, it's unlikely to justify the long time of spawning 50 instances concurrently).
On the other hand, the last POST /ports.json request is received 38.7 seconds after the first; with 150 vms, this data is much worse.

There are several components to this bottleneck, and more profiling is needed to understand which one should be tackled first.

Summary of the issue:
- it has exploded once the timeout has been reduced to 196
- it is caused by a number of factors
    - IP allocation logic locks the allocation pool, therefore sequentializing create_port request
    - Neutron API layer performs other DB operations, which result time-consuming. In some cases almost a second elapses before the call is dispatched to the plugin. In particular quota checks appear the most time consuming
    - allocate_for_instance in nova.network.neutronv2.api.py makes a number of calls into neutron API, perhaps too many. With neutron under load, this method might take a long time to complete. This is also confirmed by the fact that in some case the request for creating a port arrives to the neutron API even about 40 seconds after the asynchronous network allocation function has been called.
    - there are way too many round-trips to keystone in the process (about 50% of neutron API calls)

Getting neutron large-ops running is therefore an effort which is likely to involve both nova and neutron changes.
The last two bullets in particular, are fairly known areas which have been already designed for improvement; on the other hand it is worth pointing out that this is probably not going to be a low-hanging-fruit level job.

As a further hint, no benefit has been observed by increasing the size of the connection pool in neutron.

------
As an example, running large_ops with just 50 vms in my env completes in an avg time of about 80 seconds.
The bottleneck is allocate_for_instance in nova.network.neutronv2.api.py rather than any specific neutron call.
Considering the longest one, POST /ports.json, the worst case time observed is 4.53 seconds for completing the API request (even of this is quite high, it's unlikely to justify the long time of spawning 50 instances concurrently).
On the other hand, the last POST /ports.json request is received 38.7 seconds after the first; with 150 vms, this data is much worse.

There are several components to this bottleneck, and more profiling is needed to understand which one should be tackled first.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-12:

#12

@Anita, that patch was just intended as a test I have no intention of carrying out any work with it. By the way, I have added you as a reviewer so you should be able to access this one:

https://review.openstack.org/#/c/55960/

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-12:

#13

@Salvatore: all I am saying is that if we want to get gate-tempest-devstack-vm-neutron-large-ops back on its feet the timeout needs to be relaxed, otherwise we have to put up with failures until we get rid of all the bottlenecks in the system.

Reducing that build timeout unconditionally without making sure that the system can tolerate it, sounds a bit shortsighted to me, regardless of any regression that it was trying to address.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-11-12:

#14

Opened Nova bug for the excessive admin token calls to Neutron - https://bugs.launchpad.net/nova/+bug/1250580

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-13:

#15

As Armando pointed out, it looks like https://review.openstack.org/#/c/52639/ made the job fail. Although it isn't the root cause, which Salvatore has done a good job of digging into. Perhaps this should be broken up into two part issue.

1) Getting the large-ops test at least passing as it was before (by increasing the timeout to a value that reliably passes neutron-large-ops), in order to prevent further regressions.
2) Improve nova/neutron performance in order to get neutron-large-ops with the lower timeout as defined in https://review.openstack.org/#/c/52639/ passing reliably.

Step 1 can be this bug, and step 2 sounds a bit more involved so maybe it should be a high priority / critical bug for Icehouse.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-13:

#16

Make that, I think step 2 should be a blueprint: make-neutron-not-slow-as-molasses

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-11-13:

#17

For #1 I can pass the test with 150 vms locally (a 3GB ram VM with 2 vcpu running on macos) in 250 seconds.

For #2
Dims has a patch which albeit not resolutive improves nova/neutron communication by cutting off keystone chatter.
We can do a blueprint and call it make-neutron-not-slow-as-molasses, but it will target nova :) Also, molasses don't move anyway... so perhaps you might want to use a snail!

Jokes apart, there are three aspects to this task:
1 - reduction of chatter between nova and neutron (either critical bug or blueprint, but I'd prefer the first as it gives sense of urgency and does not really add a new feature/capability or performs extensive code changes)

This one is possibly the most important from the perspective of large-ops (and possibly parallel) tests.

2 - move out network allocation from the compute node to the API node (There has already been a lot of discussion on the mailing list); this is a blueprint, which I think arosen already registered some time ago - it needs discussion and smoothing out of details especially wrt baremetal, but it would be good to land it in icehouse.
3 - reduce contention in the neutron server, especially around IPAM. This is another delicate task as it might involve major changes in Neutron IPAM logic.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-13:

#18

Salvatore, sounds like a great plan to me. How about setting the timeout to 300 seconds, just to play it on the safe side. That will also be faster then the original timeout of 400. As for #2, a bug sounds good. It sounds like you have identified some of the major issues and we have a path going forward.

Revision history for this message

Robert Collins (lifeless) wrote on 2013-11-13:

#19

@Salvatore the MAC address constraints thing is also used by PowerVM, which like Baremetal has limits on what MACs can be used by VMs.

Joe Gordon (jogo) on 2013-11-13

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2013-11-15:

#20

Am I right into assuming that this:

https://review.openstack.org/#/c/56075/

Addressed this bug? I can no longer see the job fail.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-11-18:

#21

Armando, according to https://jenkins02.openstack.org/job/gate-tempest-devstack-vm-neutron-large-ops/buildTimeTrend it looks like large-ops is still failing every so often, but its unclear to me if its failing due to the same bugs as the other tests are hitting or a different one.

In short I think we should try to get this test a little more stable before gating on it.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-11-25:

#22

With all the stability work this bug is fixed
http://markmail.org/message/zu6hba5r6ry4yaom

Changed in neutron:
status:	In Progress → Invalid
Changed in nova:
status:	Confirmed → Fix Committed

Revision history for this message

Sean Dague (sdague) wrote on 2013-11-25:

#23

@dims I've seen this issue today multiple times, so this is not fixed by anything in jog0's email.

The logstash query showing the continued failures on this bug - http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRldnN0YWNrLXZtLW5ldXRyb24tbGFyZ2Utb3BzIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI0MzIwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODU0MDQ5Mzg5MjZ9

Changed in neutron:
status:	Invalid → Confirmed
importance:	Undecided → Critical

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-11-25:

#24

[16:37] <dims> sdague, there were a few causes that caused gate-tempest-devstack-vm-neutron-large-ops to fail. a lot of them got fixed. we should be opening a new bug with specific fingerprint like the logstash query you posted.
[16:38] <dims> sdague, the 3 reviews in the patch series https://review.openstack.org/#/c/57711/ should clear up the Timed Out issue i believe

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-11-25:

#25

Opened a new bug for just the Timed Out fingerprint
https://bugs.launchpad.net/neutron/+bug/1254890

Changed in neutron:
status:	Confirmed → Invalid

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-12-01:

#26

101 hits in the last week: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRldnN0YWNrLXZtLW5ldXRyb24tbGFyZ2Utb3BzIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg1OTI3Nzg5MzYwfQ==

Russell Bryant (russellb) on 2013-12-03

Changed in nova:
milestone:	none → icehouse-1

Revision history for this message

Joe Gordon (jogo) wrote on 2013-12-04:

#27

This returned on December 3rd.

Changed in nova:
status:	Fix Committed → Confirmed
Changed in neutron:
status:	Invalid → Confirmed

Thierry Carrez (ttx) on 2013-12-04

Changed in nova:
milestone:	icehouse-1 → icehouse-2

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-12-06:

#28

Looks like this is fixed now : ~20 hours with no recurrance

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIGJ1aWxkX25hbWU6Km5ldXRyb24tbGFyZ2Utb3BzIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg2MjkwMzc0MDMyfQ==

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-12-09:

#29

Occurances have slowed down since Dec. 5th but there are still a few out there, about 7 hits since the 5th and the composition of this post.

Thierry Carrez (ttx) on 2013-12-10

Changed in neutron:
milestone:	none → icehouse-2

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-12-16:

#30

18 hits in the last 7 days: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRGV0YWlsczogVGltZWQgb3V0IHdhaXRpbmcgZm9yIHRoaW5nXCIgQU5EIGJ1aWxkX25hbWU6Km5ldXRyb24tbGFyZ2Utb3BzIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzg3MjI0Mjg4MTQzfQ==

Aaron Rosen (arosen) on 2013-12-16

Changed in neutron:
assignee:	Anita Kuno (anteaya) → Aaron Rosen (arosen)

Revision history for this message

Joe Gordon (jogo) wrote on 2013-12-18:

#31

So I think this bug may actually be fixed, there is only one case of this happening in the gate, http://logs.openstack.org/28/61028/3/gate/gate-tempest-dsvm-neutron-large-ops/531c143/ and that doesn't appear to be caused by a timeout. There is some bug in that run but its unclear what it is.

Revision history for this message

Mark McClain (markmcclain) wrote on 2013-12-23:

#32

Showing 14 hits for the last 7 days. I think we should keep this open longer.

Revision history for this message

Brocade Jenkins (shivharis) wrote on 2014-01-09:

#33

Salvatore,

re: http://logs.openstack.org/44/54044/7/check/gate-tempest-devstack-vm-neutron-large-ops/7fdf4a3/

I found something troubing in the logs maybe you have seen it before:

100 vms are being spawned, starting at:
1st VM spawned at: 2013-12-12 15:51:36.293
...
100th VM spawned at: 2013-12-12 15:51:36.357

1st request of (Waiting for 1st VM Active) issued at: 2013-12-12 15:51:36.476
143rd request of (Waiting for 1st VM Active) issued at: 2013-12-12 15:51:43.683

It is supposed to do this every second between request, seems it did all the 143 checks in 7 seconds and then gave up.

Revision history for this message

Russell Bryant (russellb) wrote on 2014-01-13:

#34

Once Joe updated largeops to spawn 50 instances instead of 100, I think this has gone away.

Changed in nova:
status:	Confirmed → Invalid
milestone:	icehouse-2 → none

Revision history for this message

Mark McClain (markmcclain) wrote on 2014-01-13:

#35

I agree with Russell's assessment and will close this bug.

Changed in neutron:
importance:	Critical → Undecided
milestone:	icehouse-2 → none
status:	Confirmed → Invalid

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	High	Unassigned
	neutron	Invalid	Undecided	Aaron Rosen

neutron

gate-tempest-devstack-vm-neutron-large-ops is failing

Bug Description

Other bug subscribers

Remote bug watches