Comment 5 for bug 1850642

Revision history for this message
Ryan Harper (raharper) wrote : Re: Cloud init unable to find the metadata service but can CURL it

The logs do contain that bug, but I'm not sure that's the failure here.

Looking at the cloud-init logs, we can see the Ephemeral DHCP start and obtain an lease, but the end point does not respond:

2019-10-30 13:28:41,516 - dhcp.py[DEBUG]: Received dhcp lease on eth0 for 136.156.90.74/255.255.254.0
2019-10-30 13:28:41,516 - __init__.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 136.156.90.74/23 brd 136.156.91.255
2019-10-30 13:28:41,516 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '136.156.90.74/23', 'broadcast', '136.156.91.255', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,519 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,522 - util.py[DEBUG]: Running command ['ip', 'route', 'show', '0.0.0.0/0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,525 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '136.156.91.254', 'dev', 'eth0', 'src', '136.156.90.74'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,527 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '136.156.91.254', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:29:21,573 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.044 seconds
2019-10-30 13:29:21,574 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'headers': {'User-Agent': 'Cloud-Init/18.5'}, 'allow_redirects': True, 'method': 'GET', 'timeout': 10.0} configuration
2019-10-30 13:29:31,608 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [10/-1s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /openstack (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdc5e5dc210>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)'))]
2019-10-30 13:29:31,608 - DataSourceOpenStack.py[DEBUG]: Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 10 seconds
2019-10-30 13:29:31,609 - util.py[DEBUG]: Crawl of metadata service took 50.079 seconds

Cloud-init did not find the OpenStack datasource at local time:

2019-10-30 13:29:31,689 - main.py[DEBUG]: No local datasource found

So we write out a fallback network config (dhcp on eth0).

2019-10-30 13:29:31,722 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'eth0', 'mac_address': 'fa:16:3e:f4:b5:1d'}]}
2019-10-30 13:29:31,723 - __init__.py[DEBUG]: Selected renderer 'sysconfig' from priority list: None
2019-10-30 13:29:31,725 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-eth0 - wb: [644] 159 bytes
2019-10-30 13:29:31,726 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-eth0 (recursive=False)
2019-10-30 13:29:31,726 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-eth0 (recursive=False)

And then exit cloud-init-local.service:

2019-10-30 13:29:31,730 - util.py[DEBUG]: cloud-init mode 'init' took 52.389 seconds (52.39)

Now, the OS networking layer comes up:

Oct 30 13:29:31.623535 localhost.localdomain cloud-init[732]: 2019-10-30 13:29:31,619 - util.py[WARNING]: No active metadata service found
Oct 30 13:29:31.751295 localhost.localdomain systemd[1]: Started Initial cloud-init job (pre-networking).
Oct 30 13:29:31.754472 localhost.localdomain systemd[1]: Reached target Network (Pre).
Oct 30 13:29:31.756812 localhost.localdomain systemd[1]: Starting LSB: Bring up/down networking...
Oct 30 13:29:31.961050 localhost.localdomain network[866]: Bringing up loopback interface: [ OK ]
Oct 30 13:29:31.990878 localhost.localdomain network[866]: Bringing up interface eth0:
Oct 30 13:29:32.034411 localhost.localdomain dhclient[995]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0xeceb89e)
Oct 30 13:29:32.055157 localhost.localdomain dhclient[995]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0xeceb89e)
Oct 30 13:29:32.055188 localhost.localdomain dhclient[995]: DHCPOFFER from 136.156.90.11
Oct 30 13:29:32.055796 localhost.localdomain dhclient[995]: DHCPACK from 136.156.90.11 (xid=0xeceb89e)
Oct 30 13:29:34.132189 localhost.localdomain NET[1054]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Oct 30 13:29:34.166284 host-136-156-90-74 dhclient[995]: bound to 136.156.90.74 -- renewal in 34571 seconds.
Oct 30 13:29:34.167466 host-136-156-90-74 network[866]: Determining IP information for eth0... done.
Oct 30 13:29:34.224637 host-136-156-90-74 network[866]: [ OK ]
Oct 30 13:29:34.245397 host-136-156-90-74 systemd[1]: Started LSB: Bring up/down networking.
Oct 30 13:29:34.249783 host-136-156-90-74 systemd[1]: Reached target Network.

And cloud-init.service starts up, and this time it is able to connect to the IMDS:

2019-10-30 13:29:34,638 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceOpenStack.DataSourceOpenStack'>
2019-10-30 13:29:34,638 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: New instance first boot
2019-10-30 13:29:34,638 - util.py[DEBUG]: Running command ['systemd-detect-virt', '--quiet', '--container'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:29:34,641 - util.py[DEBUG]: Running command ['running-in-container'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:29:34,644 - util.py[DEBUG]: Running command ['lxc-is-container'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /proc/1/environ (quiet=False)
2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 0 bytes from /proc/1/environ
2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /proc/self/status (quiet=False)
2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 1206 bytes from /proc/self/status
2019-10-30 13:29:34,648 - util.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_name
2019-10-30 13:29:34,648 - util.py[DEBUG]: Reading from /sys/class/dmi/id/product_name (quiet=False)
2019-10-30 13:29:34,648 - util.py[DEBUG]: Read 18 bytes from /sys/class/dmi/id/product_name
2019-10-30 13:29:34,648 - util.py[DEBUG]: dmi data /sys/class/dmi/id/product_name returned OpenStack Compute
2019-10-30 13:29:34,654 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 0.006 seconds
2019-10-30 13:29:34,654 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'headers': {'User-Agent': 'Cloud-Init/18.5'}, 'allow_redirects': True, 'method': 'GET', 'timeout': 10.0} configuration
2019-10-30 13:29:36,359 - url_helper.py[DEBUG]: Read from http://169.254.169.254/openstack (200, 94b) after 1 attempts
...
2019-10-30 13:29:37,717 - util.py[DEBUG]: Crawl of openstack metadata service took 1.358 seconds
2019-10-30 13:29:37,718 - __init__.py[WARNING]: Error persisting instance-data.json: 'utf8' codec can't decode byte 0xd5 in position 1: invalid continuation byte
2019-10-30 13:29:37,718 - handlers.py[DEBUG]: finish: init-network/search-OpenStack: SUCCESS: found network data from DataSourceOpenStack
2019-10-30 13:29:37,718 - stages.py[INFO]: Loaded datasource DataSourceOpenStack - DataSourceOpenStack [net,ver=2]

So we know we're OpenStack, and as Scott pointed out, it suffers from the py27 json bug.

For this bug, the warning is not a False positive, rather, there seems to be an issue with connecting to the metadata service early in boot. I suspect we'd need some debugging enabled to see why we can't connect to the endpoint after a successful dhclient.

One interesting bit would be to compare the leases file from cloud-init Ephemeral DHCP, with the system dhcp. Another thought is if the image has any custom dhclient hooks that setup additional route information needed to hit the metadata service.

We've seen that bug before,

https://bugs.launchpad.net/cloud-init/+bug/1821102

Which is not in 18.5, so possibly a dupe of that issue.