Auto Package Testing

cloud-worker-maintenance can hang

Bug #1988080 reported by Brian Murray on 2022-08-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Auto Package Testing	New	High	Unassigned

Bug Description

The cloud-worker-maintenance job appeared to be stuck with the following in journalctl:

Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3162016]: Error: Stopping the instance failed: websocket: close 1006 (abnormal closure): unexpected EOF
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq is old - deleting
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: Traceback (most recent call last):
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 59, in <module>
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: main()
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 55, in main
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: check_remote(remote)
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 40, in check_remote
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: subprocess.check_call(
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: raise CalledProcessError(retcode, cmd)
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: subprocess.CalledProcessError: Command '['lxc', 'delete', '--force', 'lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq']' ret

To workaround the failure we can restart the service and if it works again and if that does not work delete the broken container and reboot the host.

To stop it from happening again Julian suggested adding a "TimeoutSec=1h" to cloud-worker-maintenance as a minimum. Ideally the delete call would have a 10 minute timeout with a wrapper for subprocess that handles the the timeout.

Brian Murray (brian-murray) on 2022-08-29

Changed in auto-package-testing:
importance:	Undecided → High

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.