maximum 61 "running" instances, others shutting down

Bug #462140 reported by paul guermonprez
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Eucalyptus
Invalid
Undecided
chris grzegorczyk
eucalyptus (Ubuntu)
Fix Released
Medium
Unassigned
Lucid
Won't Fix
Medium
Unassigned

Bug Description

hello

I have a cluster of 6 machines+master, 384 cores total,
seen correctly by AVAILABILITYZONE :
AVAILABILITYZONE |- vm types free / max cpu ram disk
AVAILABILITYZONE |- m1.small 0090 / 0384 1 128 2

But when i launch VMs, only 61 end up running for ever. the others are just shutting down (as seen by describe-instances).

The config is pretty standard, except for the number of VMs per subnet, changed to a higher value.
I am launching the instances with --addressing private to avoid networking limitations.

Attached are the logs and config files.

regards

Tags: eucalyptus uec
Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :
Revision history for this message
Neil Soman (neilsoman) wrote :

This issue seems similar to what we ran into on the upstream side which revnos 931 and 933 fixed.

I can attempt to reproduce it from source against the latest upstream.

Revision history for this message
Neil Soman (neilsoman) wrote :

Paul,

are these overprovisioned nodes? Or are those 384 real cores?

It looks like the front end is timing out while processing requests because the nodes may not be responding fast enough.

What about disk? Can you walk us through your hardware config?

thanks
neil

Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :

Neil, we are talking about 384 cores seen by the sum of all OSes (64 per node OS).
No special trick to overprovisione the nodes.

In the new intel machines "Nehalem" class, we have a technology called "hyperthreading" (remember pentium 4 ? same thing).
There's 192 physical cores, but seen as 384 cores by the OS.

That's the way it is supposed to be deployed in production, VT included.
But i can always try to disable the feature in the BIOS if needed.

thanks

Revision history for this message
Neil Soman (neilsoman) wrote :

Nope, hyperthreading should be fine.

I strongly suspect that addresses per subnet in eucalyptus.conf is either set to 64 (that will give you 61 instances max and is consistent with your behavior) or you increased the value but didn't restart the CC (it need to "cleanrestart").

Can you post your eucalyptus.conf?

thanks
neil

Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :

hello

The conf is in the attached tar file, but i changed the value from 32 (default) to 512 from the beginning of the install (i ran into this VNET_ADDRPERNET problem yesterday). 64 was never entered.

I've just rebooted the entire cluster and ran more tests. I can have more than 61 instances running, but some are shutting down.
I am lauching with "--addressing private" to avoid public IPs limitations (just in case).

So 61 does not seem to be a definitive threshold, and the problem does not seem to be VNET_ADDRPERNET related.

Will run more tests tomorrow morning CET.

thanks paul

Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :
Download full text (6.2 KiB)

new tests : v27.1 rebooted cluster, creating security groups,
VNET_ADDRSPERNET="512"
VNET_PUBLICIPS="192.168.3.1-192.168.5.254"

launching vms :

export EMI=emi-4158125D
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group2
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group3
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group4
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group5
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group6
euca-run-instances $EMI -n 50 -k mykey -t c1.medium --addressing private --group group7
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group8

the vms are launching fine, but only 32 end up running, the others are terminated.
when i had the "ADDRSPERNET" problem the vm was not able to launch.

for the same cluster setup, yesterday it was 64 vms, now it's 32 ...

result : some entire groups are being terminated ??? (groups 2,5,6,7 ...)
euca-describe-instances
RESERVATION r-4792084D admin group8
INSTANCE i-206F04DF emi-4158125D 172.19.28.6 172.19.28.6 running mykey 4 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-2C2305CC emi-4158125D 172.19.28.5 172.19.28.5 running mykey 3 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-36F706F4 emi-4158125D 172.19.28.9 172.19.28.9 running mykey 7 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-392C06E4 emi-4158125D 172.19.28.10 172.19.28.10 running mykey 8 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-39C70871 emi-4158125D 172.19.28.8 172.19.28.8 running mykey 6 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-3C330773 emi-4158125D 172.19.28.17 172.19.28.17 running mykey 15 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-408106F7 emi-4158125D 172.19.28.21 172.19.28.21 running mykey 19 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-4206079D emi-4158125D 172.19.28.12 172.19.28.12 running mykey 10 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-44DF08E1 emi-4158125D 172.19.28.18 172.19.28.18 running mykey 16 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-450D08F8 emi-4158125D 172.19.28.16 172.19.28.16 running mykey 14 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-455E0828 emi-4158125D 172.19.28.13 172.19.28.13 running mykey 11 c1.medium 2009-10-28T11:24:28.107Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-47360859 emi-4158125D 172.19.28.19 172.19.28.19 running mykey 17 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-50240846 emi-4158125D 172.19.28.15 172.19.28.15 running mykey 13 c1.medium 2009-10-28T11:24:28.108Z intelone eki-65F8175B eri-48E316E3
INSTANCE i-51910966 emi-4158125D 172.19.28.7 172.19.28.7 running mykey 5 c1.m...

Read more...

Revision history for this message
Thierry Carrez (ttx) wrote :

Thanks a lot for this precious testing on your impressive configuration.

I see the following in cc.log:

running cmd '///usr/lib/eucalyptus/euca_rootwrap ip addr add 172.19.16.1/23 broadcast 172.19.17.255 dev eth0:priv label eth0:priv'
could not bring up new device eth0 with ip 172.19.16.1
failed to add gateway IP to device eth0

I suspect ADDRPERNET=512 is having a side-effect here.
Could you try ADDRPERNET=100 and see if you can reliably start up ~98 instances ?
Or try ADDRPERNET=32 and start 100 instances in 4 separate groups of 25 ?

Thanks in advance !

Changed in eucalyptus (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :
Revision history for this message
paul guermonprez (paul-guermonprez-intel) wrote :
Download full text (7.0 KiB)

VNET_ADDRSPERNET="32"
see logs_3.tar.bz2 ...

result : euca-describe-instances |grep running|wc = 75

export EMI=emi-4158125D
euca-add-group -d "Group 10" group10
euca-add-group -d "Group 11" group11
euca-add-group -d "Group 12" group12
euca-add-group -d "Group 13" group13
euca-add-group -d "Group 14" group14
euca-add-group -d "Group 15" group15
euca-add-group -d "Group 16" group16
euca-add-group -d "Group 17" group17
euca-add-group -d "Group 18" group18
euca-add-group -d "Group 19" group19
euca-add-group -d "Group 20" group20
euca-add-group -d "Group 21" group21
euca-add-group -d "Group 22" group22
euca-add-group -d "Group 23" group23
euca-add-group -d "Group 24" group24
euca-add-group -d "Group 25" group25
euca-add-group -d "Group 26" group26
euca-add-group -d "Group 27" group27
euca-add-group -d "Group 28" group28
euca-add-group -d "Group 29" group29
euca-add-group -d "Group 30" group30
euca-add-group -d "Group 31" group31
euca-add-group -d "Group 32" group32
euca-add-group -d "Group 33" group33
euca-add-group -d "Group 34" group34
euca-add-group -d "Group 35" group35
euca-add-group -d "Group 36" group36
euca-add-group -d "Group 37" group37
euca-add-group -d "Group 38" group38
euca-add-group -d "Group 39" group39
euca-add-group -d "Group 40" group40
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group10
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group11
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group12
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group13
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group14
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group15
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group16
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group17
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group18
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group19
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group20
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group21
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group22
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group23
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group24
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group25
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group26
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group27
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group28
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group29
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addressing private --group group30
euca-run-instances $EMI -n 20 -k mykey -t c1.medium --addr...

Read more...

Revision history for this message
Thierry Carrez (ttx) wrote :

We need some upstream advice on this...

Note that VNET_PUBLICIPS="192.168.3.1-192.168.5.254" is, I think, incorrect. PUBLICIPS only accepts ranges within a single class C. You should write instead:
VNET_PUBLICIPS="192.168.3.1-192.168.3.255 192.168.4.0-192.168.4.255 192.168.5.0-192.168.5.254"
Note sure it has an effect though, since you now test with --addressing private.

Changed in eucalyptus (Ubuntu):
status: Incomplete → Confirmed
tags: added: eucalyptus
Nick Barcet (nijaba)
tags: added: uec
Revision history for this message
Thierry Carrez (ttx) wrote :

Paul, upstream says they haven't hit such a limit in their testing, they think it must come from some configuration issue or status that would be carried from tests to tests.

Dan: could you please confirm that you don't run into any limit running the same test that Paul ran in comment 10, namely:
VNET_ADDRSPERNET="32"
Create 30 security groups
For each group, run: euca-run-instance -n 20 --addressing private --group X
Paul can only get 75 machines running, with other being terminated *by security group*.
Do you confirm that VNET_PUBLICIPS value shouldn't factor in that test, since we are using private addressing.

Paul: could you confirm you start the test from a pristine state. That includes rebooting (or restarting eucalyptus) if you are using the current karmic packages, or running "sudo restart eucalyptus CLEAN=1" if you are running the current karmic-proposed packages.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Marking incomplete, as we have outstanding requests for information from both Paul and Dan.

I have loaded up my (much smaller) Lucid cluster here with 56 VMs without a problem.

Revision history for this message
Thierry Carrez (ttx) wrote :

Marking incomplete/medium per last comment

Changed in eucalyptus (Ubuntu):
importance: High → Medium
status: Confirmed → Incomplete
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Targeting at A3, as we should test this setup in our scalability testing next week at the Distro Sprint.

Changed in eucalyptus (Ubuntu):
milestone: none → lucid-alpha-3
Thierry Carrez (ttx)
Changed in eucalyptus (Ubuntu Lucid):
milestone: lucid-alpha-3 → none
Revision history for this message
Thierry Carrez (ttx) wrote :

Unnominating for lucid, can't be reproduced so far...

Changed in eucalyptus (Ubuntu Lucid):
status: Incomplete → Won't Fix
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

We ran 96 instances in Lucid a few minutes ago. I really believe this is fix-released in Lucid. Please re-open with detailed information if anyone reproduces in Lucid.

Please make sure that VNET_ADDRSPERNET is higher than the number of instances you want to run, plus 3. So if you want to run 61 instances, make sure that VNET_ADDRSPERNET is at least 64.

Changed in eucalyptus (Ubuntu):
status: Incomplete → Fix Released
Changed in eucalyptus:
status: New → Invalid
assignee: nobody → chris grzegorczyk (chris-grze)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.