nova-compute runaway memory allocation

Bug #938699 reported by Andrew Glen-Young
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Committed
High
Russell Bryant
nova (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

This morning I upgraded my Openstack hosts to the latest packaged version of the nova packages. After the upgrade some of the compute node machines were rather unresponsive.

After investigation it became apparent that nova-compute was nearly exhausting the node's memory. I captured as much data as I dared before having to terminate the process.

The symptoms seem to have gone away after restarting nova-compute again. This occured on three of five compute nodes, but I was unable to reproduce (yet) after restarting nova-compute.

$ ps axfuwww | grep compute
nova 17606 0.0 0.0 48040 2220 ? Ss 13:19 0:00 su -s /bin/sh -c exec nova-compute --flagfile=/etc/nova/nova.conf --flagfile=/etc/nova/nova-compute.conf nova
nova 17607 4.3 48.6 14240940 12000856 ? Dl 13:19 0:13 \_ /usr/bin/python /usr/bin/nova-compute --flagfile=/etc/nova/nova.conf --flagfile=/etc/nova/nova-compute.conf

$ sudo strace -p 17607
Process 17607 attached - interrupt to quit
brk(0x22db06000) = 0x22db06000
brk(0x22db27000) = 0x22db27000
brk(0x22db48000) = 0x22db48000
[...]
brk(0x22e7c9000) = 0x22e7c9000
brk(0x22e7ea000) = 0x22e7ea000
brk(0x22e80b000) = 0x22e80b000
^CProcess 17607 detached

/var/log/nova/nova-compute.log:

[...]
2012-02-22 13:21:12,543 DEBUG nova.compute.manager [-] image_id=e34cdc7d-bfce-4064-a80f-3916cfd65ea8, image_size_bytes=1476395008, allowed_size_bytes=10737418240 from (pid=17607) _check_image_size /usr/lib/python2.7/dist-packages/nova/compute/manager.py:488
2012-02-22 13:21:12,544 AUDIT nova.compute.manager [req-ba92fb06-884f-48f1-89eb-e03b2bb68df3 evand evand_project] instance 761fdfeb-40bb-497e-9fc8-687f4291644b: starting...
2012-02-22 13:21:12,628 DEBUG nova.rpc.common [-] Making asynchronous call on network ... from (pid=17607) multicall /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:318
2012-02-22 13:21:12,629 DEBUG nova.rpc.common [-] MSG_ID is 8f75c505d29a4128a04244f6549a9e75 from (pid=17607) multicall /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:321
2012-02-22 13:21:12,629 DEBUG nova.rpc.common [-] Pool creating new connection from (pid=17607) create /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:57
2012-02-22 13:21:12,633 DEBUG amqplib [-] Start from server, version: 8.0, properties: {u'information': u'Licensed under the MPL. See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2011 VMware, Inc.', u'capabilities': {}, u'platform': u'Erlang/OTP', u'version': u'2.6.1'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US'] from (pid=17607) _start /usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py:661
2012-02-22 13:21:12,634 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:12,635 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:12,636 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:12,637 DEBUG nova.compute.manager [-] deallocating network for instance: 94fbd530-96d3-4d74-a202-1091820edcc4 from (pid=17607) _deallocate_network /usr/lib/python2.7/dist-packages/nova/compute/manager.py:579
2012-02-22 13:21:12,638 DEBUG nova.rpc.common [-] Making asynchronous cast on network... from (pid=17607) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:343
2012-02-22 13:21:12,640 DEBUG amqplib [-] Open OK! known_hosts [] from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py:507
2012-02-22 13:21:12,640 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:12,641 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:12,642 INFO nova.rpc.common [-] Connected to AMQP server on 10.55.58.1:5672
2012-02-22 13:21:12,643 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:12,644 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:12,645 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:12,646 DEBUG nova.compute.manager [-] Checking state of 94fbd530-96d3-4d74-a202-1091820edcc4 from (pid=17607) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:251
2012-02-22 13:21:14,302 INFO nova.virt.libvirt.firewall [-] [instance: 94fbd530-96d3-4d74-a202-1091820edcc4] Attempted to unfilter instance which is not filtered
2012-02-22 13:21:14,302 INFO nova.virt.libvirt.connection [-] [instance: 94fbd530-96d3-4d74-a202-1091820edcc4] Deleting instance files /srv/nova/instances/instance-0000152f
2012-02-22 13:21:15,283 INFO nova.virt.libvirt.connection [-] [instance: 94fbd530-96d3-4d74-a202-1091820edcc4] Instance destroyed successfully.
2012-02-22 13:21:15,286 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:15,287 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:15,288 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:15,288 DEBUG nova.compute.manager [-] deallocating network for instance: 761fdfeb-40bb-497e-9fc8-687f4291644b from (pid=17607) _deallocate_network /usr/lib/python2.7/dist-packages/nova/compute/manager.py:579
2012-02-22 13:21:15,289 DEBUG nova.rpc.common [-] Making asynchronous cast on network... from (pid=17607) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:343
2012-02-22 13:21:15,290 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:15,291 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:15,292 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:15,292 DEBUG nova.compute.manager [-] Checking state of 761fdfeb-40bb-497e-9fc8-687f4291644b from (pid=17607) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:251
2012-02-22 13:21:16,987 INFO nova.virt.libvirt.firewall [-] [instance: 761fdfeb-40bb-497e-9fc8-687f4291644b] Attempted to unfilter instance which is not filtered
2012-02-22 13:21:16,987 INFO nova.virt.libvirt.connection [-] [instance: 761fdfeb-40bb-497e-9fc8-687f4291644b] Deleting instance files /srv/nova/instances/instance-0000154c
2012-02-22 13:21:17,918 INFO nova.virt.libvirt.connection [-] [instance: 761fdfeb-40bb-497e-9fc8-687f4291644b] Instance destroyed successfully.
2012-02-22 13:21:17,918 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:17,919 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:17,920 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:17,921 DEBUG nova.compute.manager [-] deallocating network for instance: 94fbd530-96d3-4d74-a202-1091820edcc4 from (pid=17607) _deallocate_network /usr/lib/python2.7/dist-packages/nova/compute/manager.py:579
2012-02-22 13:21:17,922 DEBUG nova.rpc.common [-] Making asynchronous cast on network... from (pid=17607) cast /usr/lib/python2.7/dist-packages/nova/rpc/amqp.py:343
2012-02-22 13:21:17,924 DEBUG amqplib [-] Closed channel #1 from (pid=17607) _do_close /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:90
2012-02-22 13:21:17,925 DEBUG amqplib [-] using channel_id: 1 from (pid=17607) __init__ /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:70
2012-02-22 13:21:17,926 DEBUG amqplib [-] Channel open from (pid=17607) _open_ok /usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py:484
2012-02-22 13:21:17,926 DEBUG nova.compute.manager [-] Checking state of 94fbd530-96d3-4d74-a202-1091820edcc4 from (pid=17607) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:251
2012-02-22 13:21:19,680 INFO nova.virt.libvirt.firewall [-] [instance: 94fbd530-96d3-4d74-a202-1091820edcc4] Attempted to unfilter instance which is not filtered
2012-02-22 13:21:19,681 INFO nova.virt.libvirt.connection [-] [instance: 94fbd530-96d3-4d74-a202-1091820edcc4] Deleting instance files /srv/nova/instances/instance-0000152f

^^^ The log was not being updated with additional information while nova-compute was running away

$ dpkg-query --show nova-compute
nova-compute 2012.1~e4~20120217.12709-0ubuntu1

$ cat /etc/lsb-release·
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu precise (development branch)"

Please let me know if I can help any further?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Marking as confirmed as I've run into this last week when deploying to the CI lab. I've only hit it once and was unable to reproduce during subsequent deployments. nova-compute was exhausting all memory and eventually being killed off by OOM killer.

Revision history for this message
Chuck Short (zulcss) wrote :

I think I traced this down to the usb tablet option in nova/virt/libvirt/connection.py

Revision history for this message
Russell Bryant (russellb) wrote :

Chuck: nice find. Turning that option off fixes this for me.

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Russell Bryant (russellb) wrote :

Actually, I take that back. I think I got lucky ... it didn't fix the problem for me. I may not be having the same problem, though.

Revision history for this message
Russell Bryant (russellb) wrote :

This may be fixed by the patch that went in for bug 934575. Can you give that a try?

Revision history for this message
James Troup (elmo) wrote :

We're running with this patch now. Very early results seem positive, but we'll follow up later when it's been running long enough to be sure.

Revision history for this message
James Troup (elmo) wrote : Re: [Bug 938699] Re: nova-compute runaway memory allocation

James Troup <email address hidden> writes:

> We're running with this patch now. Very early results seem positive,
> but we'll follow up later when it's been running long enough to be sure.

| 12018 nova 20 0 1667m 96m 2920 S 2 0.4 3:17.56 nova-compute
Same as it was over 9 hours ago and holding stable on all compute nodes.
Patch is a winner.

--
James

Changed in nova:
assignee: nobody → Russell Bryant (russellb)
status: Confirmed → Fix Committed
milestone: none → essex-4
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.