API locks up during serialization, Number of nodes to serialize impacts lockup time

Bug #1479090 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Ihor Kalnytskyi

Bug Description

We have found that if we deploy 50 nodes, then we can't reach Fuel UI and Fuel API during around 5 minutes after initiating deployment.

cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas,Sahara,Murano,7.0-98
Controllers:3 Computes+Ceph:47

api: '1.0'
astute_sha: 34e0493afa22999c4a07d3198ceb945116ab7932
auth_required: true
build_id: 2015-07-27_09-24-22
build_number: '98'
feature_groups:
- mirantis
fuel-agent_sha: 2a65f11c10b0aeb5184247635a19740fc3edde21
fuel-library_sha: 39c3162ee2e2ff6e3af82f703998f95ff4cc2b7a
fuel-ostf_sha: 94a483c8aba639be3b96616c1396ef290dcc00cd
fuelmain_sha: 921918a3bd3d278431f35ad917989e46b0c24100
nailgun_sha: d5c19f6afc66b5efe3c61ecb49025c1002ccbdc6
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 58c411d87a7eaf0fd6892eae2b5cb1eff4190c98
release: '7.0'

Diagnostic Snapshot:http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-28_18-28-50.tar.xz

Tags: scale
Dina Belova (dbelova)
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
milestone: none → 7.0
description: updated
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

The same situation is reproducible on 200 nodes cluster. On 200 nodes it hangs for 20 minutes. Sounds like there is some linear dependency on number of nodes.

Revision history for this message
Dina Belova (dbelova) wrote :

Just a comment: on 200 nodes this delay takes more than 25 mins, we still don't know how much exactly (env still hanged)

Changed in fuel:
importance: High → Critical
Revision history for this message
Leontii Istomin (listomin) wrote :

at the moment env with 200 nodes is in hanged status more then 2 hours

Revision history for this message
Dina Belova (dbelova) wrote :

Env came back. Fuel API was unreachable from 29th July 14:25:54 to 29th July 16:32:34 at least

Dina Belova (dbelova)
summary: - UI and API are unreachable during 5 minutes if deploy 50 nodes
+ UI and API are unreachable during 5 minutes (50 nodes deploy) and for
+ 2.5 hours (if 200 nodes are deployed)
Revision history for this message
Andrew Woodward (xarses) wrote : Re: UI and API are unreachable during 5 minutes (50 nodes deploy) and for 2.5 hours (if 200 nodes are deployed)

I was able to reproduce non-linear scaling on fake UI with 8 nodes

test with 0 - 8 nodes http://paste.openstack.org/show/406322/

nodes time derivative time per node
0 475 0 0
1 1075 600 600
2 1793 718 659
3 2643 850 722.6666666667
4 3788 1145 828.25
5 4706 918 846.2
6 6313 1607 973
7 8431 2118 1136.5714285714
8 10498 2067 1252.875

(times in msec)

there is a clear coloration in the number of nodes, and how long the serializer takes to finish returning the data.

Andrew Woodward (xarses)
summary: - UI and API are unreachable during 5 minutes (50 nodes deploy) and for
- 2.5 hours (if 200 nodes are deployed)
+ API locks up during serialization, Number of nodes to serialize impacts
+ lock time
summary: API locks up during serialization, Number of nodes to serialize impacts
- lock time
+ lockup time
Revision history for this message
Andrew Woodward (xarses) wrote :

output from python profile of nailgun.api.v1.handlers.orchestrator.DefaultDeploymentInfo

http://paste.openstack.org/show/406326/

6749077 function calls (6496262 primitive calls) in 36.288 seconds

1) There appear to be alot of calls to get_node_networks

      305 0.032 0.000 31.528 0.103 ./fuel-web/nailgun/nailgun/network/manager.py:637(get_node_networks)

2) Which results in alot of calls to _get_admin_node_network
      305 0.028 0.000 25.372 0.083 ./fuel-web/nailgun/nailgun/network/manager.py:581(_get_admin_node_network)

3) Then we see its called 8 (number of nodes, suspicious...) less times in get_node_network_by_netname

      297 0.000 0.000 30.560 0.103 ./fuel-web/nailgun/nailgun/network/manager.py:601(get_node_network_by_netname)

We can see that https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/network/manager.py#L601-605 is actually in-efficient, it relies on https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/network/manager.py#L638-657 which will make networks*3+1 DB calls (with default networks=4*3+1 = 13) instead of 3 for the data we need.

I think we can cut the db trips by 1/4 if we change get_node_networks to call get_node_network_by_netname with an empty filter and move the getting / and add filtering logic to get_node_network_by_netname

Looking at what changed in the calling of get_node_network_by_netname, we can see that the new neutron serializer
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L823

I'm not 100% but I think it's in the difference of the old static list, vs the new dynamic one
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L846-847
vs
https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/neutron_serializers.py#L279-291

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Sylwester Brzeczkowski (sbrzeczkowski)
Revision history for this message
Dina Belova (dbelova) wrote :

Folks, let's propose the fix asap for us to try next time we'll redeploy 200 nodes environment. Thanks in advance.

Changed in fuel:
assignee: Sylwester Brzeczkowski (sbrzeczkowski) → Igor Kalnitsky (ikalnitsky)
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

I've found the reason why it works slow -

        6 0.000 0.000 6.323 1.054 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/orchestrator/neutron_serializers.py:951(network_provider_node_attrs)
      216 0.007 0.000 6.222 0.029 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/network/manager.py:701(get_node_networks)
      210 0.001 0.000 6.027 0.029 /home/ikalnitsky/devel/fuel-web/nailgun/nailgun/network/manager.py:665(get_node_network_by_netname)

On 6 nodes the "get_node_networks" function is called 216 times. For 5 nodes - it's 155, and for 4 - 104. Obviously, we have an algebraic progression here. I'm thinking how to remove the call numbers.

Meantime, we have a mule.. I don't understand why our API is unresponsive. The logs you have provided doesn't contain a lot of information. For instance, uwsgi.log and nginx_access.log are empty.

Changed in fuel:
assignee: Igor Kalnitsky (ikalnitsky) → Sylwester Brzeczkowski (sbrzeczkowski)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/207997

Changed in fuel:
assignee: Sylwester Brzeczkowski (sbrzeczkowski) → Igor Kalnitsky (ikalnitsky)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Sylwester Brzeczkowski (<email address hidden>) on branch: master
Review: https://review.openstack.org/207917
Reason: abandoned due to https://review.openstack.org/#/c/207997/3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/207997
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=31ca9f3ffc0f4e3457595970d3e2f532526b704c
Submitter: Jenkins
Branch: master

commit 31ca9f3ffc0f4e3457595970d3e2f532526b704c
Author: Igor Kalnitsky <email address hidden>
Date: Fri Jul 31 18:10:49 2015 +0300

    Decrease amount of 'get_node_networks` calls

    We had a lot of 'get_node_networks' calls for the same node. This
    function call is too expensive and takes a lot of time. It cause some
    serious problems on scale, since its call number was increasing with
    a progression. Since now, it's predictable and takes only 3 * N times.

    Here's the quick comparison:

      NODES | WAS | NOW
      ----------------
        6 | 216 | 18 // was: 6.88s, now: 0.62s
        5 | 155 | 15
        4 | 104 | 12
        3 | 63 | 9

    Partial-Bug: #1479090

    Change-Id: Ia8ea44149fbb3c90aa015ad5ac0f03f4cd9d33ec
    Signed-off-by: Igor Kalnitsky <email address hidden>

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

There's no new logs, there's no way to investigate initial reason why uWSGI's listen queue has no free slots. Since the bug was fixed partially by fixing incredible performance degradation I move it to Fix Commited.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Leontii Istomin (listomin) wrote :

Hasn't been reproduced at least with 7.0-293 Build

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.