Machines fail to deploy because cloud-init needs to accept both netplan spellings for grat arp

Bug #1827238 reported by Jason Hobbs
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Undecided
Andres Rodriguez
cloud-init
Fix Released
Medium
Ryan Harper
curtin
Invalid
Undecided
Unassigned

Bug Description

Many nodes failed to boot after installation.

Here is one example, beartic.

beartic.
finishes install: 2019-05-01T10:36:17+00:00 beartic systemd[1]: Stopping Unattended Upgrades Shutdown...

dhcp's after reboot:
10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPDISCOVER from 14:02:ec:41:c7:dc via broam
10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:10+00:00 leafeon dhcpd[21312]: DHCPOFFER on 10.244.41.7 to 14:02:ec:41:c7:dc via broam
10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPREQUEST for 10.244.41.7 (10.244.40.30) from 14:02:ec:41:c7:dc via broam
10.244.40.32/var/log/maas/rsyslog/leafeon/2019-05-01/messages:2019-05-01T10:38:13+00:00 leafeon dhcpd[21312]: DHCPACK on 10.244.41.7 to 14:02:ec:41:c7:dc via broam

grub and grub.cfg:
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7
10.244.40.30/var/log/maas/rackd.log:2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

but we never got any rsyslog message or api calls after that.

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (4.2 KiB)

So this is what I see on the logs:

1. On rackd.log on .32, I see the machine PXE boot to start the deployment process:

2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by 10.244.41.7
2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by 10.244.41.7
2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7

2. On rackd.log on .30, I see it pxe boot post-deployment (and its told to localboot):

2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 10.244.41.7
2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info] grubx64.efi requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/command.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/fs.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg requested by 10.244.41.7
2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info] /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7

3. I see that curtin has run the deployment process and hasn't reported any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/

So, from all the information above, I don't think we have enough information to know what the issue is.

A. The machine was never instructed to localboot.
B. The machine was instructed to localboot, but grub failed.
C. The machine booted onto the disk, but either didn't get network or failed to contact metadata.
D. There is a firmw...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1827238] Re: 2.6beta2: many nodes failed deployment with time out
Download full text (4.4 KiB)

On Wed, May 1, 2019 at 11:40 AM Andres Rodriguez <email address hidden>
wrote:

> So this is what I see on the logs:
>
> 1. On rackd.log on .32, I see the machine PXE boot to start the
> deployment process:
>
> 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:33 provisioningserver.rackdservices.tftp: [info]
> grubx64.efi requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/command.lst requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/fs.lst requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/grub.cfg requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.tftp: [info]
> /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
> 2019-05-01 10:32:34 provisioningserver.rackdservices.http: [info]
> /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-kernel requested by
> 10.244.41.7
> 2019-05-01 10:32:36 provisioningserver.rackdservices.http: [info]
> /images/ubuntu/amd64/ga-18.04/bionic/daily/boot-initrd requested by
> 10.244.41.7
> 2019-05-01 10:32:58 provisioningserver.rackdservices.http: [info]
> /images/ubuntu/amd64/ga-18.04/bionic/daily/squashfs requested by 10.244.41.7
>
> 2. On rackd.log on .30, I see it pxe boot post-deployment (and its told
> to localboot):
>
> 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:13 provisioningserver.rackdservices.tftp: [info]
> bootx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:14 provisioningserver.rackdservices.tftp: [info]
> grubx64.efi requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/command.lst requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/fs.lst requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/crypto.lst requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/x86_64-efi/terminal.lst requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/grub.cfg requested by 10.244.41.7
> 2019-05-01 10:38:15 provisioningserver.rackdservices.tftp: [info]
> /grub/grub.cfg-14:02:ec:41:c7:dc requested by 10.244.41.7
>
>
> 3. I see that curtin has run the deployment process and hasn't reported
> any errors - log: https://pastebin.ubuntu.com/p/zMgTttxdSj/ | curtin
> config: https://pastebin.ubuntu.com/p/Y2ZMX6Rstd/
>
> So, from all the information above, I don't think we have enough
> information to know what the issue is.
>
> A. The machine was never in...

Read more...

Changed in maas:
status: New → Incomplete
Revision history for this message
Ryan Harper (raharper) wrote : Re: 2.6beta2: many nodes failed deployment with time out

I don't see any errors during this install for curtin. It installs grub2 uefi, adds the entry (0017), reorders back to boot from network (0016).

Please reopen the curtin task if you believe there's a curtin error during the installation.

Changed in curtin:
status: New → Invalid
tags: added: cdo-release-blocker
Revision history for this message
Andres Rodriguez (andreserl) wrote :

So there are no apparent issues from the MAAS perspective, we really need to see console logs to be able to determine what's the issue. Keeping this as incomplete for MAAS until console logs can be provided.

PS. You can add kernel params to the machine to log to console, and setup conserver-server to remotely gather the information automatically without manual intervention (once conserver-server has been setup of course).

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Do you have a tcpdump of the network traffic? That is needed to be able to determine what the issue is. We need to inspect the HTTP headers that is being sent from nginx to the client. Specifically the "Content-Length" header to see if there is a mismatch there.

It is possible that the correct Content-Length is being sent to the client, but the client is either closing the HTTP connection to soon or a TCP reset is occurring. That will still report as a 200 response on the server side as the response was 200. The difference will be if the actual amount of data sent to the client matches the Content-Length of the response HTTP header.

Knowing if those mismatch will be the first step, because there is a few outcomes:

1. The server is reading the file wrong and setting the Content-Length not to the actual file length.
2. The server is sending the file and something occurs for the file handler to be closed so the streaming action is stopped (I would expect a log in nginx errors if this were to occur, but maybe not)
3. The client is closing the connection before reading all the data being streamed from the server.
4. A TCP reset is occurring break the TCP connection, giving the same result of #3 but caused by TCP reset instead of client disconnect.

Also the output of the whole tree structure of /var/lib/maas/boot-resources and byte sizes (in bytes) so the Content-Length can be compared to the actual file.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I'm struggling to find the right parameters to get serial console. I did notice we get this error when booting after the install

Booting local disk...
Failed to open \efi\boot\grubx64.efi - Not Found
Failed to load image \efi\boot\grubx64.efi: Not Found
start_image() returned Not Found

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

console output: http://paste.ubuntu.com/p/HyCmWrHYqw/

[ 55.404228] cloud-init[1039]: self._handle_bond_bridge(command, cmd_type='bond')
[ 55.416254] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 690, in _handle_bond_bridge
[ 55.428150] cloud-init[1039]: item_params.get('parameters', {}).items())
[ 55.440297] cloud-init[1039]: File "/usr/lib/python3/dist-packages/cloudinit/net/network_state.py", line 689, in <genexpr>
[ 55.452161] cloud-init[1039]: 'params': dict((v2key_to_v1[k], v) for k, v in
[ 55.464226] cloud-init[1039]: KeyError: 'gratuitous-arp'

Revision history for this message
Andres Rodriguez (andreserl) wrote :

MAAS sends this snippet:

network:
  bonds:
    bond0:
      interfaces:
      - eth6
      - eth7
      macaddress: 00:11:0a:66:2e:24
      mtu: 9000
      parameters:
        down-delay: 0
        gratuitous-arp: 1
        mii-monitor-interval: 100
        mode: active-backup
        transmit-hash-policy: layer2
        up-delay: 0

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: Machines fail to deploy because cloud-init misspells gratuitous-arp as gratuitious-arp

To document what the issue is, MAAS sends the key 'gratuitous-arp', following netplan, however cloud-init misspells this key as 'gratuitious-arp'. MAAS will *workaround* this issue until cloud-init fixes it.

summary: - 2.6beta2: many nodes failed deployment with time out
+ Machines fail to deploy because cloud-init misspells gratuitous-arp as
+ gratuitious-arp
Changed in maas:
milestone: none → 2.6.0rc1
assignee: nobody → Andres Rodriguez (andreserl)
status: New → In Progress
Revision history for this message
Ryan Harper (raharper) wrote :

cloud-init included the only (mis)spelling of gratuitous-arp, per

https://bugs.launchpad.net/netplan/+bug/1756701

This is now fixed in netplan and released. Cloud-init needs to accept both values now.

summary: - Machines fail to deploy because cloud-init misspells gratuitous-arp as
- gratuitious-arp
+ Machines fail to deploy because cloud-init needs to accept both netplan
+ spellings for grat arp
Changed in cloud-init:
importance: Undecided → Medium
status: New → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Ryan Harper (raharper) wrote :

The misspelling is part of the original netplan; it has been fixed but only in bionic and newer. For max compatibility it's best to always render/supply the misspelling. Cloud-init will accept either key, but for now will only render the misspelling as all netplan releases support this key.

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1827238] Re: Machines fail to deploy because cloud-init needs to accept both netplan spellings for grat arp

On Sat, May 04, 2019 at 12:34:00AM -0000, Ryan Harper wrote:
> The misspelling is part of the original netplan; it has been fixed but
> only in bionic and newer. For max compatibility it's best to always
> render/supply the misspelling. Cloud-init will accept either key, but
> for now will only render the misspelling as all netplan releases support
> this key.

In the original netplan, it's also misspelled in the output to networkd, so
probably never worked? From xenial, and bionic release pocket:

./src/networkd.c: g_string_append_printf(params, "\nGratuitiousARP=%d", def->bond_params.gratuitious_arp);

In the NM backend that's spelled 'num_grat_arp' in the output so should
work, but that's also largely irrelevant for cloud-init I would think.

Changed in maas:
milestone: 2.6.0rc1 → 2.6.0beta3
Dan Watkins (oddbloke)
Changed in cloud-init:
assignee: nobody → Ryan Harper (raharper)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit ded1ec81 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=ded1ec81

Changed in cloud-init:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Ryan Harper (raharper) wrote : Fixed in cloud-init version 19.2.

This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.