Fuel for OpenStack

API locks up during serialization, Number of nodes to serialize impacts lockup time

Bug #1479090 reported by Leontii Istomin on 2015-07-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Critical	Ihor Kalnytskyi	Fuel for OpenStack 7.0

Bug Description

We have found that if we deploy 50 nodes, then we can't reach Fuel UI and Fuel API during around 5 minutes after initiating deployment.

cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:47

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot:http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-28_18-28-50.tar.xz

See original description

Tags:

Dina Belova (dbelova) on 2015-07-28

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High
milestone:	none → 7.0

Leontii Istomin (listomin) on 2015-07-28

description:

updated

Nastya Urlapova (aurlapova) on 2015-07-29

Changed in fuel:
assignee:	nobody → Fuel Python Team (fuel-python)

Revision history for this message

Georgy Okrokvertskhov (gokrokvertskhov) wrote on 2015-07-29:

The same situation is reproducible on 200 nodes cluster. On 200 nodes it hangs for 20 minutes. Sounds like there is some linear dependency on number of nodes.

Revision history for this message

Dina Belova (dbelova) wrote on 2015-07-29:

Just a comment: on 200 nodes this delay takes more than 25 mins, we still don't know how much exactly (env still hanged)

Changed in fuel:
importance:	High → Critical

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-29:

at the moment env with 200 nodes is in hanged status more then 2 hours

Revision history for this message

Dina Belova (dbelova) wrote on 2015-07-29:

Env came back. Fuel API was unreachable from 29th July 14:25:54 to 29th July 16:32:34 at least

Dina Belova (dbelova) on 2015-07-29

summary:

- UI and API are unreachable during 5 minutes if deploy 50 nodes
+ UI and API are unreachable during 5 minutes (50 nodes deploy) and for
+ 2.5 hours (if 200 nodes are deployed)

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-07-29: Re: UI and API are unreachable during 5 minutes (50 nodes deploy) and for 2.5 hours (if 200 nodes are deployed)

I was able to reproduce non-linear scaling on fake UI with 8 nodes

test with 0 - 8 nodes http://paste.openstack.org/show/406322/

nodes time derivative time per node
0 475 0 0
1 1075 600 600
2 1793 718 659
3 2643 850 722.6666666667
4 3788 1145 828.25
5 4706 918 846.2
6 6313 1607 973
7 8431 2118 1136.5714285714
8 10498 2067 1252.875

(times in msec)

there is a clear coloration in the number of nodes, and how long the serializer takes to finish returning the data.

Andrew Woodward (xarses) on 2015-07-29

summary:	- UI and API are unreachable during 5 minutes (50 nodes deploy) and for - 2.5 hours (if 200 nodes are deployed) + API locks up during serialization, Number of nodes to serialize impacts + lock time
summary:	API locks up during serialization, Number of nodes to serialize impacts - lock time + lockup time

Revision history for this message

Andrew Woodward (xarses) wrote on 2015-07-30:

output from python profile of nailgun.api.v1.handlers.orchestrator.DefaultDeploymentInfo

http://paste.openstack.org/show/406326/

6749077 function calls (6496262 primitive calls) in 36.288 seconds

1) There appear to be alot of calls to get_node_networks

305 0.032 0.000 31.528 0.103 ./fuel-web/nailgun/nailgun/network/manager.py:637(get_node_networks)

2) Which results in alot of calls to _get_admin_node_network
305 0.028 0.000 25.372 0.083 ./fuel-web/nailgun/nailgun/network/manager.py:581(_get_admin_node_network)

3) Then we see its called 8 (number of nodes, suspicious...) less times in get_node_network_by_netname

297 0.000 0.000 30.560 0.103 ./fuel-web/nailgun/nailgun/network/manager.py:601(get_node_network_by_netname)

We can see that https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/network/manager.py#L601-605 is actually in-efficient, it relies on https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/network/manager.py#L638-657 which will make networks*3+1 DB calls (with default networks=4*3+1 = 13) instead of 3 for the data we need.

I think we can cut the db trips by 1/4 if we change get_node_networks to call get_node_network_by_netname with an empty filter and move the getting / and add filtering logic to get_node_network_by_netname

Looking at what changed in the calling of get_node_network_by_netname, we can see that the new neutron serializer
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L823

I'm not 100% but I think it's in the difference of the old static list, vs the new dynamic one
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L846-847
vs
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L279-291

Sylwester Brzeczkowski (sbrzeczkowski) on 2015-07-30

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Sylwester Brzeczkowski (sbrzeczkowski)

Revision history for this message

Dina Belova (dbelova) wrote on 2015-07-30:

Folks, let's propose the fix asap for us to try next time we'll redeploy 200 nodes environment. Thanks in advance.

Ihor Kalnytskyi (ikalnytskyi) on 2015-07-31

Changed in fuel:
assignee:	Sylwester Brzeczkowski (sbrzeczkowski) → Igor Kalnitsky (ikalnitsky)

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2015-07-31:

I've found the reason why it works slow -

        6 0.000 0.000 6.323 1.054 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/orchestrator/neutron_serializers.py:951(network_provider_node_attrs)
      216 0.007 0.000 6.222 0.029 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/network/manager.py:701(get_node_networks)
      210 0.001 0.000 6.027 0.029 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/network/manager.py:665(get_node_network_by_netname)

On 6 nodes the "get_node_networks" function is called 216 times. For 5 nodes - it's 155, and for 4 - 104. Obviously, we have an algebraic progression here. I'm thinking how to remove the call numbers.

Meantime, we have a mule.. I don't understand why our API is unresponsive. The logs you have provided doesn't contain a lot of information. For instance, uwsgi.log and nginx_access.log are empty.

OpenStack Infra (hudson-openstack) on 2015-07-31

Changed in fuel:
assignee:	Igor Kalnitsky (ikalnitsky) → Sylwester Brzeczkowski (sbrzeczkowski)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-31: Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/207997

Changed in fuel:
assignee:	Sylwester Brzeczkowski (sbrzeczkowski) → Igor Kalnitsky (ikalnitsky)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-03: Change abandoned on fuel-web (master)

#10

Change abandoned by Sylwester Brzeczkowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/207917
Reason: abandoned due to https://review.openstack.org/#/c/207997/3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-03: Fix merged to fuel-web (master)

#11

Reviewed: https://review.openstack.org/207997
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=31ca9f3ffc0f4e3457595970d3e2f532526b704c
Submitter: Jenkins
Branch: master

commit 31ca9f3ffc0f4e3457595970d3e2f532526b704c
Author: Igor Kalnitsky <email address hidden>
Date: Fri Jul 31 18:10:49 2015 +0300

Decrease amount of 'get_node_networks` calls

    We had a lot of 'get_node_networks' calls for the same node. This
    function call is too expensive and takes a lot of time. It cause some
    serious problems on scale, since its call number was increasing with
    a progression. Since now, it's predictable and takes only 3 * N times.

Here's the quick comparison:

      NODES | WAS | NOW
      ----------------
        6 | 216 | 18 // was: 6.88s, now: 0.62s
        5 | 155 | 15
        4 | 104 | 12
        3 | 63 | 9

Partial-Bug: #1479090

Change-Id: Ia8ea44149fbb3c90aa015ad5ac0f03f4cd9d33ec
Signed-off-by: Igor Kalnitsky <email address hidden>

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2015-08-07:

#12

There's no new logs, there's no way to investigate initial reason why uWSGI's listen queue has no free slots. Since the bug was fixed partially by fixing incredible performance degradation I move it to Fix Commited.

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-17:

#13

Hasn't been reproduced at least with 7.0-293 Build

Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.