failed to generate config when interface was renamed

Bug #1983516 reported by Chris Patterson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
High
Unassigned

Bug Description

2022-08-03 18:42:31,598 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [644] 1359 bytes
2022-08-03 18:42:31,598 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True)
2022-08-03 18:42:31,875 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eth2'] with allowed return codes [0] (shell=False, capture=True)
2022-08-03 18:42:31,880 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eth0'] with allowed return codes [0] (shell=False, capture=True)
2022-08-03 18:42:31,956 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eth7'] with allowed return codes [0] (shell=False, capture=True)
2022-08-03 18:42:31,959 - util.py[WARNING]: failed stage init-local
2022-08-03 18:42:31,959 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 740, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 410, in main_init
    init.apply_network_config(bring_up=bring_up_interfaces)
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 937, in apply_network_config
    return self.distro.apply_network_config(
  File "/usr/lib/python3/dist-packages/cloudinit/distros/__init__.py", line 233, in apply_network_config
    self._write_network_state(network_state)
  File "/usr/lib/python3/dist-packages/cloudinit/distros/debian.py", line 142, in _write_network_state
    return super()._write_network_state(network_state)
  File "/usr/lib/python3/dist-packages/cloudinit/distros/__init__.py", line 129, in _write_network_state
    renderer.render_network_state(network_state)
  File "/usr/lib/python3/dist-packages/cloudinit/net/netplan.py", line 260, in render_network_state
    self._net_setup_link(run=self._postcmds)
  File "/usr/lib/python3/dist-packages/cloudinit/net/netplan.py", line 282, in _net_setup_link
    subp.subp(cmd, capture=True)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 335, in subp
    raise ProcessExecutionError(
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eth7']
Exit code: 1
Reason: -
Stdout:
Stderr: Load module index
        Parsed configuration file /usr/lib/systemd/network/99-default.link
        Parsed configuration file /usr/lib/systemd/network/73-usb-net-by-mac.link
        Parsed configuration file /run/systemd/network/10-netplan-eth3.link
        Parsed configuration file /run/systemd/network/10-netplan-eth2.link
        Parsed configuration file /run/systemd/network/10-netplan-eth1.link
        Parsed configuration file /run/systemd/network/10-netplan-eth0.link
        Created link configuration context.
        Failed to open device '/sys/class/net/eth7': No such device
        Unload module index
        Unloaded link configuration context.

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks @ChrisPatterson for continuing to help us out here on big systems.

Looks like a case where the network rename by the kernel is colliding with cloud-init.

I'm thinking the failure symptom is the following:
  - cloud-init calls get_devicelist and looping starts looping through devices found [1]
  - kernel renames some nic and sysfs gets updated
  - cloud-init is unable to finish the loop of calls to 'udevadm', 'test-builtin', 'net_setup_link', <PREVIOUS/STALE_DEVICE_NAME>

We need to better handle this potential race condition in cloud-init and vet whether a rename happened out from under us, or block the renames in the kernel temporarily if we can.

References:

[1] https://github.com/canonical/cloud-init/blob/main/cloudinit/net/netplan.py#L279-L284

Changed in cloud-init:
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
Revision history for this message
Chad Smith (chad.smith) wrote :

I think I'll mark this High and we can discuss tomorrow mitigation steps here.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (6.3 KiB)

fwiw, this issue is affecting me as well. I only see it on real hardware, but apparently it helps to add a lot of bridge interfaces to trigger the issue, particularly OVS bridges.

The Traceback I see refers to a real interface name, so I think this may occur under other circumstances than interface rename:

2022-08-16 10:23:30,009 - __init__.py[DEBUG]: Selected renderer 'netplan' from priority list: ['netplan', 'eni', 'sysconfig']
2022-08-16 10:23:30,009 - netplan.py[DEBUG]: V2 to V2 passthrough
2022-08-16 10:23:30,014 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [644] 4180 bytes
2022-08-16 10:23:30,014 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,188 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/ovs-system'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,191 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/lo'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,195 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/bondM'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,200 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/enp129s0f0'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,204 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eno1'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,207 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/br-bond0.2808'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,212 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/br-bond0.2806'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,215 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/br-bond0'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,220 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/br-bond0.2804'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,225 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/bond0'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,229 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/enp129s0f1'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,234 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/eno2'] with allowed return codes [0] (shell=False, capture=True)
2022-08-16 10:23:30,239 - subp.py[DEBUG]: Running command ['udevadm', 'test-builtin', 'net_setup_link', '/sys/class/net/br-bond0.2807'] with allowed return codes [0] (shell=False, c...

Read more...

Revision history for this message
Frode Nordahl (fnordahl) wrote (last edit ):

This rudimentary patch [0] works around the issue for me. For anyone stuck on this issue I put it in this PPA [1], which can be used by the MAAS Package repos feature to slip it into a deployment.

It does not help for the situation where `udevadm test-builtin net_setup_link` is called on an actual non-existing interface though, which is what the OP reported, but the two variants of the issue appear closely connected to me.

Should we expand the bug to cover both cases, or do you want a separate bug for attempting to call `udevadm test-builtin net_setup_link` on an interface that apparently is not completely initialized yet?

0: https://pastebin.ubuntu.com/p/pHqbwJwVPh/
1: https://launchpad.net/~fnordahl/+archive/ubuntu/lp1983516

Revision history for this message
James Falcon (falcojr) wrote :

Hey Frode, thanks for the patch, but we recently committed a (similar) fix: https://github.com/canonical/cloud-init/pull/1655

Changed in cloud-init:
status: Triaged → Fix Committed
Revision history for this message
Brett Holman (holmanb) wrote : Fixed in cloud-init version 22.3.

This bug is believed to be fixed in cloud-init in version 22.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Frode Nordahl (fnordahl) wrote :

The 22.3 package does indeed appear to fix the issue, thank you for the quick turnaround!

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.