Wait for network before downloading ssh credentials or user-data

Bug #308530 reported by Eric Hammond
48
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu on EC2
Invalid
Medium
Unassigned
Hardy
Invalid
Undecided
Unassigned
Intrepid
Invalid
Undecided
Unassigned
Jaunty
Invalid
Undecided
Unassigned
VMBuilder
Invalid
Undecided
Unassigned
ec2-init (Ubuntu)
Fix Released
Undecided
Unassigned
Hardy
Fix Released
Critical
Scott Moser
Intrepid
Won't Fix
Medium
Unassigned

Bug Description

When an instance boots quickly on Amazon EC2 the init code can occasionally try to use the network before the network is working.

This may not happen on most boots, but it is definitely worth adding in some code to check for the network before attempting to do critical tasks like:
- download ssh public key
- download user-data

Here is a sample of the script I use to download ssh credentials on the existing Ubuntu EC2 images:

  http://ec2ubuntu.googlecode.com/svn/trunk/etc/init.d/ec2-get-credentials

Note the code like the following which waits until the network is working before it continues:

perl -MIO::Socket::INET -e 'until(new IO::Socket::INET("169.254.169.254:80")){sleep 1}'

Until this check was added, people experienced random failures where an AMI would boot ok, but they were unable to ssh in because the ssh credentials download had failed.

Many non-Ubuntu AMIs do not experience this problem because they boot so slowly :)

Eric Hammond (esh)
Changed in vmbuilder:
status: New → Invalid
Revision history for this message
Eric Hammond (esh) wrote :

After a fair amount of testing, I have not experienced this problem on the current official Ubuntu beta AMI for Intrepid. This might be because Intrepid starts more slowly than the Hardy AMI I build (2min vs 40sec). It's possible that if an official Hardy AMI were released it would experience this problem. It's also possible that as more people use the Ubuntu kernel these AMIs will boot faster, so I still recommend tossing in the fix.

Rick Clark (dendrobates)
Changed in ubuntu-on-ec2:
importance: Undecided → Medium
milestone: none → beta3
status: New → In Progress
Revision history for this message
Eric Hammond (esh) wrote :

I have experienced this again with the latest official Ubuntu Hardy image (ami-5d59be34) and we now have reports from at least one other user who is experiencing problems that could be caused by this timing issue:

  http://groups.google.com/group/ec2ubuntu/browse_thread/thread/996d2911c2926ccf

This will become even more of an issue as the startup processes are moved earlier in the boot process (e.g,. for #370628)

I have attached my console output. In particular, note:

   * Setting EC2 defaults [80G ^M[74G[[31mfail[39;49m]
   * Fetching EC2 login credentials [80G ^M[74G[[31mfail[39;49m]

Revision history for this message
JCornett (john-cornett) wrote :

Just wanted to add my name to the list of those affected by this bug. I get the instance up and running but cannot log in with the key I used. The console log via elasticfox shows a network failure retrieving the needed information when it happens.

Anecdotally, I'm seeing about 60-70% failures during slow parts of the day, and then much less (but still present) during typically heavier usage times (afternoon/evening). Presumably due to longer launch times.

I am concerned about using this image for production right now because I may not be able to get a new instance up quickly enough for my clients due to the failure to attain credentials. I would prefer to use the Hardy version as it is a LTS server.

By the way... outstanding work on the great Ubuntu EC2 offerings! I would complain, but fast launch times are a great problem to have as long as we can actually use the instance. :) Keep up the great work and thanks for everything you all do for us.

Revision history for this message
Eric Hammond (esh) wrote :

Soren, you attached a branch ".../jaunty.waitfornetwork" with a patch which looks like it will solve the problem. However, I'd like to make sure that this fix is also applied to the next Hardy and Intrepid images and not just Jaunty. The variety of branches being used to build these images seems like it is ripe for error, but perhaps somebody more organized than I is keeping track of everything :-/

Revision history for this message
Chuck Short (zulcss) wrote :

Eric,

It will be.

chuck

Revision history for this message
Christopher Armstrong (radix) wrote :

I think I may be running into this problem with Landscape. I'm not sure exactly how you guys are going to solve this problem, but if it requires changing each init script which needs the network to do something special, then landscape-client's init script will also need to be updated (since it checks for EC2 user data as well). If so, please ping me and I'll be able to make the changes upstream on Landscape.

Revision history for this message
Christopher Armstrong (radix) wrote :

Okay, yeah, I just looked at soren's branch. if that's the way this problem is going to be fixed, then something is also going to need to change in landscape-client, since landscape-client has a lower priority than ec2-init. Not only that, landscape-client (25) is even lower than networking (40), which is another bug itself.... I've filed bug #383336 in landscape-client.

Revision history for this message
Abernix (abernix) wrote :

I believe I have also experienced this issue. The 169.254.169.254 address could not be accessed when the /etc/init.d/ec2-init script was run thereby causing me to be unable to ssh into the server with my private key. The instance had to be terminated and relaunched. It worked fine the second time.

Revision history for this message
Abernix (abernix) wrote :

As an addition to my previous now, this is happening 4 out of 5 times for me in the new availability zone (us-east-1d). Presumably because it is starting up so quick due to the load being relatively low on the new zone.

Eric Hammond (esh)
description: updated
Revision history for this message
Chuck Short (zulcss) wrote :

This is fixed for the next version.

Changed in ubuntu-on-ec2:
status: In Progress → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote :

I've verified this is fixed in http://ec2-images.ubuntu.com/karmic/20090811.2/ubuntu-ec2-karmic-amd64.img.gz (ec2-init 0.4.99-0ubuntu2).

/etc/init.d/ec2-init invokes 'ec2-wait-for-meta-data-service' as the first thing in its 'start'.

That program (/usr/bin/ec2-wait-for-meta-data-service) waits for the service, so the script will not move on until the meta-data service is up and running.

I tested that the ec2-wait-for-meta-data-service blocks on my local system using test environment set up from https://wiki.ubuntu.com/UEC/Images/Testing. I brought down the httpd daemon on the local system and ran ec2-wait-for-meta-data-service from inside. That blocked until I started the httpd daemon on the host system.

Please re-open if you believe otherwise.

Changed in ubuntu-on-ec2:
status: Fix Committed → Fix Released
Scott Moser (smoser)
Changed in ec2-init (Ubuntu):
status: New → Confirmed
Scott Moser (smoser)
Changed in ec2-init (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Anoop P B (anoop-pb) wrote :

The bug continues to exist in the official Hardy AMIs:
https://help.ubuntu.com/community/EC2StartersGuide#Getting the images

Will it be fixed for Hardy (LTS)?

It will also be nice to have a shiny new 8.04.3 image (current one is based off of 8.04.2 and pulls a lot of updates)

Revision history for this message
Scott Moser (smoser) wrote : Re: [Bug 308530] Re: Wait for network before downloading ssh credentials or user-data

> Will it be fixed for Hardy (LTS)?

We're going to re-spin the hardy image, and will include this fix.

> It will also be nice to have a shiny new 8.04.3 image (current one is
> based off of 8.04.2 and pulls a lot of updates)
Agreed.

Scott Moser (smoser)
tags: added: ec2-images uec-images
Scott Moser (smoser)
Changed in ubuntu-on-ec2:
status: Fix Released → Invalid
Scott Moser (smoser)
Changed in ec2-init (Ubuntu Hardy):
importance: Undecided → Critical
Changed in ec2-init (Ubuntu Intrepid):
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :

I'm looking at the ubuntu-on-ec2 ppa at https://launchpad.net/~ubuntu-on-ec2/+archive/ppa , specifically both https://launchpad.net/%7Eubuntu-on-ec2/+archive/ppa/+sourcepub/667702/+listing-archive-extra (hardy) and https://launchpad.net/%7Eubuntu-on-ec2/+archive/ppa/+sourcepub/667837/+listing-archive-extra (intrepid) have a checkServer() call and sleeps.

Since we do not have ec2-init in hardy, I'm marking this fix-commited there. I believe if we re-spin an image we should have the fix.

Changed in ec2-init (Ubuntu Hardy):
status: New → Fix Committed
Revision history for this message
Anoop P B (anoop-pb) wrote :

Scott, thanks for that.

Will the new images be spun out soon?
Since the current images are nearly unusable, can't wait to get Ubuntu on our EC2 boxes :)

Also, do EC2 images follow a similar schedule to that of the regular Ubuntu ISO releases?

Revision history for this message
pwolanin (pwolanin) wrote :

We are also eagerly/urgently anticipating this fix for 8.04. Is there some other place to track it or any expected release date?

Revision history for this message
ChipAnderson (chipa) wrote :

We hit this same problem trying to use the "Official" Hardy Server AMI images that are listed in the Amazon ECS Management Console. The only AMI we can connect to is the i386 version ("ami-5d59be34") on the m1.small type of server. Using the AMD version of the AMI ("ami-2959be40") or using the i386 version on the larger server types results in not being able to connect to the instance using SSH.

Was a new "Official" Hardy AMI image "spun out" from this fix? I can't find any "official" AMIs on the Amazon site that were created after this bug was supposedly fixed.

Hardy is important to us to because of the LTS aspect of Hardy support.

Revision history for this message
Scott Moser (smoser) wrote :

This is definitely not fixed in hardy ppa. I've verified its busted.

Changed in ec2-init (Ubuntu Hardy):
status: Fix Committed → In Progress
assignee: nobody → Scott Moser (smoser)
Revision history for this message
Scott Moser (smoser) wrote :

Marking this fix-commited. I've backported ec2-init 0.4999 from karmic's version to hardy (0.4.999-0ubuntu5~hardy1) and that is available in the ubuntu-on-ec2 ppa (https://launchpad.net/~ubuntu-on-ec2/+archive/ppa?field.series_filter=hardy).

Images are available on ec2 with the necessarily changes (build 20091105 or later).

Changed in ec2-init (Ubuntu Hardy):
status: In Progress → Fix Committed
Revision history for this message
Scott Moser (smoser) wrote :

Also, wanted to mention here. I had email conversation here with Martin from rightscale. The backport of ec2-init removes the rightscale-init script that the old package included. He said this was fine with them, that they can make use of the user-data script functionality as they're doing in karmic.

Revision history for this message
Anoop P B (anoop-pb) wrote :

The Amazon EC2 site still shows me only the old images:
ami-5d59be34 - canonical hardy 32bit
ami-2959be40 - canonical hardy 64bit

Where are the new builds?

Revision history for this message
Scott Moser (smoser) wrote :

eu-west-1 ami-cf1932bb ubuntu-images-eu/ubuntu-hardy-8.04-amd64-server-20091130.manifest.xml
eu-west-1 ami-c31932b7 ubuntu-images-eu/ubuntu-hardy-8.04-i386-server-20091130.manifest.xml
us-east-1 ami-4428ca2d ubuntu-images-us/ubuntu-hardy-8.04-amd64-server-20091130.manifest.xml
us-east-1 ami-7e28ca17 ubuntu-images-us/ubuntu-hardy-8.04-i386-server-20091130.manifest.xml

Changed in ec2-init (Ubuntu Hardy):
status: Fix Committed → Fix Released
Scott Moser (smoser)
Changed in ec2-init (Ubuntu Intrepid):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.