Ubuntu CI has been flaky for a week

Bug #1829829 reported by Evgeny Vereshchagin
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

It was originally reported in https://github.com/systemd/systemd/pull/12583#issuecomment-492949206 5 days ago. To judge from the logs VMs can't be rebooted there:
```
Ubuntu 18.04.2 LTS autopkgtest ttyS0

autopkgtest login:
---------------------------------------------------
------- nova show 91e76a78-d05c-412a-b383-55a26010ae69 (adt-bionic-amd64-systemd-upstream-20190516-051604) ------
+--------------------------------------+------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | euler |
| OS-EXT-SRV-ATTR:hypervisor_hostname | euler.lcy01.scalingstack |
| OS-EXT-SRV-ATTR:instance_name | instance-003d216a |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-05-16T07:00:42.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2019-05-16T07:00:33Z |
| flavor | autopkgtest (f878e70e-9991-46e0-ba02-8ea159a71656) |
| hostId | 1722c5f2face86c3fc9f338ae96835924721512372342f664e6941bd |
| id | 91e76a78-d05c-412a-b383-55a26010ae69 |
| image | adt/ubuntu-bionic-amd64-server-20190516.img (d00bf12c-467e-433f-a4f5-15720f13bff1) |
| key_name | testbed-juju-prod-ues-proposed-migration-machine-11 |
| metadata | {} |
| name | adt-bionic-amd64-systemd-upstream-20190516-051604 |
| net_ues_proposed_migration network | 10.42.40.13 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | <email address hidden> |
| status | ACTIVE |
| tenant_id | afaef86b96dd4828a1ed5ee395ea1421 |
| updated | 2019-05-16T07:00:42Z |
| user_id | 8524250971084851b3792a68fbc398dd |
+--------------------------------------+------------------------------------------------------------------------------------+
---------------------------------------------------
<VirtSubproc>: failure: Timed out on waiting for ssh connection
autopkgtest [07:07:45]: ERROR: testbed failure: cannot send to testbed: [Errno 32] Broken pipe
```

Though judging by https://github.com/systemd/systemd/pull/12626, it appears that sometimes the tests pass.

Revision history for this message
Martin Pitt (pitti) wrote :

Indeed the downstream tests fail like this as well: http://autopkgtest.ubuntu.com/packages/systemd/eoan/amd64

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

@pitti would it be possible to temporarily skip the tests where VMs are rebooted to reduce the blast radius so to speak? I really don't want to turn Ubuntu CI off completely.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :
Download full text (6.1 KiB)

To judge from https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-bionic-upstream-systemd-ci-systemd-ci/bionic/i386/s/systemd-upstream/20190521_185445_b37c3@/log.gz, VMs seems to also be throwing kernel panics:
```
[ 0.894903] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 0.895921] EDD information not available.
[ 0.985183] isapnp: No Plug & Play device found
[ 0.986013] md: Waiting for all devices to be available before autodetect
[ 0.987166] md: If you don't use raid, use raid=noautodetect
[ 0.988330] md: Autodetecting RAID arrays.
[ 0.989038] md: autorun ...
[ 0.989525] md: ... autorun DONE.
[ 0.990139] VFS: Cannot open root device "UUID=961ce78b-c06e-4632-9aa0-56a8356b081b" or unknown-block(0,0): error -6
[ 0.991923] Please append a correct "root=" boot option; here are the available partitions:
[ 0.993406] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 0.994818] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.15.0-50-generic #54-Ubuntu
[ 0.996121] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 0.997704] Call Trace:
[ 0.998138] dump_stack+0x58/0x81
[ 0.998713] panic+0x94/0x1e6
[ 0.999233] mount_block_root+0x199/0x234
[ 1.000111] ? calibrate_delay+0x190/0x4e0
[ 1.000833] ? create_dev.constprop.7+0x3a/0x53
[ 1.001611] mount_root+0x33/0x35
[ 1.002182] prepare_namespace+0x11b/0x167
[ 1.002885] kernel_init_freeable+0x1b3/0x1c5
[ 1.003634] ? rest_init+0xa0/0xa0
[ 1.004238] kernel_init+0xd/0x100
[ 1.004830] ret_from_fork+0x1c/0x38
[ 1.005892] Kernel Offset: 0x10000000 from 0xc1000000 (relocation range: 0xc0000000-0xdf5fdfff)
[ 1.007376] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
```
And there is another one at https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-bionic-upstream-systemd-ci-systemd-ci/bionic/i386/s/systemd-upstream/20190521_100015_69099@/log.gz
```
[ 41.006515] IP: rollback_registered_many+0x141/0x390
[ 41.007872] *pdpt = 000000001da0b001 *pde = 0000000000000000
[ 41.009424] Oops: 0002 [#1] SMP PTI
[ 41.010396] Modules linked in: kvm_intel kvm irqbypass crc32_pclmul pcbc aesni_intel aes_i586 crypto_simd cryptd joydev input_leds serio_raw mac_hid qemu_fw_cfg xt_TCPMSS xt_tcpudp iptable_mangle ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq psmouse pata_acpi virtio_blk virtio_net i2c_piix4 floppy
[ 41.017754] CPU: 0 PID: 85 Comm: kworker/u2:2 Not tainted 4.15.0-50-generic #54-Ubuntu
[ 41.019883] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 41.022385] Workqueue: netns cleanup_net
[ 41.023459] EIP: rollback_registered_many+0x141/0x390
[ 41.024825] EFLAGS: 00010202 CPU: 0
[ 41.025783] EAX: 00000008 EBX: 00000008 ECX: 00000020 EDX: 00000000
[ 41.027477] ESI: dae31ec4 EDI: 00ffffff EBP: dae31eac ESP: dae31e88
[ 41.029162] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 41.030641] CR0: 80050033 CR2: dfc746c0 CR3: 1da28000 CR4: 001406f0
[ ...

Read more...

Changed in systemd (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Dan Streetman (ddstreet) wrote :

This seems to have 'fixed' itself (unless some fix was done without mention in this bug), as the bionic systemd runs on amd64 have not had this problem since june 3, and i386 since june 1.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

I turned on bionic-amd64 and bionic-i386 yesterday. VMs no longer fail to boot but apparently something else was broken on bionic-i386 while it was off: https://github.com/systemd/systemd/issues/12891.

In https://github.com/systemd/systemd/pull/12861#issuecomment-506025351 both bionic-amd64 and bionic-s390x failed due to something weird I don't quite understand. It would be great if someone who is interested in keeping it afloat could chime in there to help to debug that.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

Anyway, as I said in https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1831296/comments/3, I don't think Ubuntu CI in its current form is suitable for CI so I'll just turn it off as soon as it starts failing again. I'm afraid I don't have time for keeping an eye on it and reporting whatever comes up there on bugs.launchpad.net.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

By the way, judging by http://autopkgtest.ubuntu.com/running#pkg-systemd-upstream, it seems Ubuntu CI keeps running the tests for PR 12618, which was merged about a month ago.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> apparently something else was broken on bionic-i386 while it was off:
> https://github.com/systemd/systemd/issues/12891.

looks like this has been fixed already upstream

> In https://github.com/systemd/systemd/pull/12861#issuecomment-506025351 both
> bionic-amd64 and bionic-s390x failed due to something weird I don't quite understand.
> It would be great if someone who is interested in keeping it afloat could chime in
> there to help to debug that.

I'll take a look at those failures.

> it seems Ubuntu CI keeps running the tests for PR 12618

looks like someone stopped that; it's only running tests for 12888, 12897, and 12899 currently.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

> looks like this has been fixed already upstream

Yes. I merged the PR (created by @yuwata) fixing that this morning.

> looks like someone stopped that; it's only running tests for 12888, 12897, and 12899 currently.

In https://github.com/systemd/systemd/issues/12891#issuecomment-506093934 I asked Iain Lane to stop that, which I think explains why it's not there any more.

> I'll take a look at those failures.

Those failures most likely have something to do with that PR and probably should be fixed upstream. I think it would be better to somehow address https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1824864. arm64 has been off since December 2018 due to that issue.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> I think it would be better to somehow address
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1824864.
> arm64 has been off since December 2018 due to that issue.

yep, the ubuntu bionic kernel will be patched to increase the arm64 log buffer size:
https://lists.ubuntu.com/archives/kernel-team/2019-June/101613.html
however the kernel release cycle is long, so that won't happen until end of July (earliest):
https://lists.ubuntu.com/archives/kernel-sru-announce/2019-June/000151.html

but, the test case in debian-experimental has a mr that should stop CI failing once merged:
https://salsa.debian.org/systemd-team/systemd/merge_requests/34

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

Good to know! Thank you!

Would it also be possible to mark the "upstream" test (https://salsa.debian.org/systemd-team/systemd/blob/master/debian/tests/control#L113) "flaky" or turn it off altogether? We run it on CentOS and Arch anyway so it should be safe. On Ubuntu CI it's particularly flaky and I don't think it makes much sense to keep running it there until some stuff like timeouts isn't adjusted to VMs on the Ubuntu CI infrastructure. I think at this point it would make sense to make sure that Ubuntu-specific tests are in good shape.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

In the meantime, I brought arm64 back almost as soon as https://salsa.debian.org/systemd-team/systemd/merge_requests/34 was merged. It looks promising except that the "upstream" test is flaky there as well. It'd be great to turn it off on Ubuntu CI.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Would it also be possible to mark the "upstream" test "flaky" or turn it off altogether

I'll have to defer to @xnox on that...personally I would prefer to leave it on and try to fix whatever is causing the test(s) to be flaky, and/or disable only individual upstream tests instead of marking all of them flaky.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

I agree ideally the test should be fixed but it's been flaky for a couple of years and I didn't notice anyone who would be willing to try to figure it out. I think it's time to admit nobody cares. Even if I'm totally wrong and someone is interested in getting it to work on Ubuntu CI, it should be possible to turn it off globally so as not to annoy contributors with known issues they shouldn't even have to care about and keep trying to reproduce it "locally" (as we do on CentOS CI, Travis CI, Azure Pipelines and so on and so forth).

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Even if I'm totally wrong and someone is interested in getting it to work on Ubuntu CI

I'll work on this; I would like it to work correctly.

> it should be possible to turn it off globally so as not to annoy contributors with known issues

I agree about doing *something* for flaky tests - it makes the entire test suite unreliable if you can't trust the end result.

There is an existing upstream test blacklist:
https://salsa.debian.org/systemd-team/systemd/blob/master/debian/tests/upstream#L11

how about if we keep that up to date with any upstream tests that are flaky in ubuntu ci?

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

Regarding the blacklist, we started to work on it in https://github.com/systemd/systemd/issues/11195 (which was initially about merging PRs where bugs were caught by Travis CI but ignored because everybody was used to Ubuntu CI failing more often than not) and ended up with a list of test I suggested because I more or less knew what was flaky at the time and kept monitoring Ubuntu CI for a couple of days. As far as I can tell, it has been abandoned since then. If you're ready to keep the list up to date by visiting PRs where Ubuntu CI fails I'm of course all for it.

Other than that, right now the upstream test timeouts on i386, which has been on for just two days: https://github.com/systemd/systemd/pull/12901#issuecomment-506646998. I'm not sure how turning off some tests will help there (because generally I don't understand why it fails).

Plus, in https://github.com/systemd/systemd/pull/12861#issuecomment-506556062 the test is completely broken on Ubuntu CI and I have no idea why.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

Just to clarify, I think that it would be better to turn off the test globally because we know it's flaky and unstable and we generally never roll out globally flaky tests so as not to annoy contributors (I'm not sure why Ubuntu CI should be different from any other CI system we use upstream). And then, it should be possible to run the test over and over again somewhere to see what fails updating the list along the way.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> right now the upstream test timeouts on i386

ah, ok - so I did look into this, or at least some failures really similar to this, a while back, for bug 1831468. As far as I could tell, the problem in those cases was that the test was starting a service and waiting for notification of its completion; but for some reason, dbus seemed to be crashing/dying/exiting, and the notification of the completed service never actually reached the test suite, so it just waited forever until timeout. I do see in the logs for the test you referenced:

Unable to add match type='signal',sender='org.freedesktop.DBus',path='/org/freedesktop/DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0=':1.7', failing connection: Method call timed out
Bus bus-api-system: changing state RUNNING → CLOSING
Unable to add match type='signal',sender='org.freedesktop.DBus',path='/org/freedesktop/DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0=':1.7', failing connection: Connection terminated
Bus bus-api-system: changing state CLOSING → CLOSED
systemd-rfkill.socket: Changed dead -> listening
systemd-rfkill.socket: Job 364 systemd-rfkill.socket/start finished, result=done
[ OK ] Listening on Load/Save RF …itch Status /dev/rfkill Watch.
Bus bus-api-system: changing state UNSET → OPENING
Bus bus-api-system: changing state OPENING → AUTHENTICATING

that's quite similar to what i saw before. I never had a chance to get farther than that, but I can try to look more into it soon. Let me know if that sounds familiar or if you have any debugging suggestions.

> Plus, in https://github.com/systemd/systemd/pull/12861#issuecomment-506556062 the
> test is completely broken on Ubuntu CI and I have no idea why.

ugh, yeah TEST-15 and TEST-22 aren't giving any clues at all in the log to why they failed.

> As far as I can tell, it has been abandoned since then. If you're ready to keep
> the list up to date by visiting PRs where Ubuntu CI fails I'm of course all for it.

Let's add TEST-15 and TEST-22 for now, and I'll see if I can both keep up with other flaky tests, and also try to figure out why they're being flaky in ubuntu-ci.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Opened MR to blacklist TEST-15 and TEST-22:

https://salsa.debian.org/systemd-team/systemd/merge_requests/35

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

> Opened MR to blacklist TEST-15 and TEST-22

That PR on GitHub hasn't been merged yet so I think it's too soon to turn those tests off.

> ah, ok - so I did look into this, or at least some failures really similar to this, a while back, for bug 1831468.

As far as I can tell, several tests failed there because QEMU_TIMEOUT kicked in. It's different from what is going on on i386 on GitHub where the whole upstream test fails due to the global timeout.

> Let me know if that sounds familiar or if you have any debugging suggestions

Depending on where the test is run it could be 300s isn't enough for it to finish. Given that it's not particularly popular platform I assume it's run in VMs where the upstream test also launch VMs with QEMU with no acceleration whatsoever. Could you try bumping up QEMU_TIMEOUT to see whether it helps?

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

Regarding i386, I downloaded the log and took a look at what failed there. Turns out all the tests except for TEST-34-DYNAMICUSERMIGRATE, where the global timeout kicked in, passed. Apparently, to judge from https://salsa.debian.org/systemd-team/systemd/blob/master/debian/tests/upstream, on Ubuntu CI the tests are still run one by one whereas on CentOS CI we started to run them in parallel back in March (https://github.com/systemd/systemd-centos-ci/pull/78) so it's no wonder the whole test times out sporadically. I don't expect the script in the Debian package to be fixed any time soon so could anyone either bump up the timeout or turn the test off?

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

By the way, @ddstreet would it be OK to mention that you maintain Ubuntu CI at https://www.freedesktop.org/wiki/Software/systemd/autopkgtest/? Currently it's almost impossible to figure out who is responsible for it. At some point I assumed it was @xnox but it doesn't seem the case so I don't even try to cc @xnox any more.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> That PR on GitHub hasn't been merged yet so I think it's too soon to turn those tests off.

ack - I closed the MR to debian to blacklist them.

> Apparently on Ubuntu CI the tests are still run one by one

yep. I'll look at how Centos runs them to see if we can run them in parallel on Ubuntu as well; that should really significantly speed up the overall test time.

> could anyone either bump up the timeout or turn the test off?

The overall timeout is set from autopkgtest.ubuntu.com, so the test itself can't request a longer timeout IIUC, but it could be marked as a "long running":
https://git.launchpad.net/autopkgtest-cloud/tree/worker-config-production/worker.conf#n20
The default timeout is 10,000s, while long running tests get 40,000s. However, that's over 11 hours; I'd much rather work to get test runs under the default 10,000s (~3 hours).

Let's see if we can get the 'upstream' tests parallelized, I think that would avoid ever exceeding the main test timeout, except for actual hung test cases.

> would it be OK to mention that you maintain Ubuntu CI at
> https://www.freedesktop.org/wiki/Software/systemd/autopkgtest

it looks like @laney is listed there, which is probably appropriate (he has admin access to the test systems, while I don't), but if you want to add me as a contact for systemd Ubuntu-CI that's fine with me.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Looking at the latest 20 i386 tests, I see failing tests for 12888, 12884, 12912, 12918, and 12919.

PR 12884 looks like it's adding a new test, TEST-35, so I'll assume that test just needs more work.

PR 12888 failed once, but there are 3 later tests that all pass, so I assume it's ok now.

PR 12919 appears to have failed while building upstream systemd, so I assume that's a problem with the PR.

PR 12912 passed once, but then was run again and failed due to TEST-34 timing out.
PR 12918 failed also with TEST-34 timing out.

However, these aren't failing because they just didn't have enough time to run. They are failing because they are running "systemd-run --wait" but the --wait never completes, even though the service finishes. This is what I was talking about before when I mentioned that dbus (bus-api-system) appears to be crashing/restarting. Note in the failing test upstream-stdout:

$ grep -E '(bus-api-system|run-u7)' upstream-stdout
...
run-u7.service: Installed new job run-u7.service/start as 322
run-u7.service: Enqueued job run-u7.service/start as 322
Bus bus-api-system: changing state RUNNING → CLOSING
Bus bus-api-system: changing state CLOSING → CLOSED
Bus bus-api-system: changing state UNSET → OPENING
Bus bus-api-system: changing state OPENING → AUTHENTICATING
run-u7.service: Failed to set 'memory.min' attribute on '/system.slice/run-u7.service' to '0': No such file or directory
run-u7.service: Failed to set 'memory.swap.max' attribute on '/system.slice/run-u7.service' to 'max': No such file or directory
run-u7.service: Failed to set 'memory.oom.group' attribute on '/system.slice/run-u7.service' to '0': No such file or directory
run-u7.service: Passing 0 fds to service
run-u7.service: About to execute: /usr/bin/test -f /var/lib/zzz/test
run-u7.service: Forked /usr/bin/test as 274
run-u7.service: Changed dead -> runningFound pre-existing private StateDirectory= directory /var/lib/private/zzz, migrating to /var/lib/zzz.
run-u7.service: Job 322 run-u7.service/start finished, result=done
run-u7.service: Executing: /usr/bin/test -f /var/lib/zzz/testsystemd-journald.service: Got notification message from PID 210 (WATCHDOG=1)
run-u7.service: Child 274 belongs to run-u7.service.
run-u7.service: Main process exited, code=exited, status=0/SUCCESS
run-u7.service: Succeeded.
run-u7.service: Changed running -> dead
run-u7.service: Consumed 59ms CPU time.
run-u7.service: Collecting.
Bus bus-api-system: changing state AUTHENTICATING → HELLO
Bus bus-api-system: changing state HELLO → RUNNING

It looks like bus-api-system is down when the service completes, and the notification that it finished never reaches the system-run --wait caller (the testsuite), so the testsuite just hangs doing nothing until the test timeout.

Revision history for this message
Dan Streetman (ddstreet) wrote :

From https://github.com/systemd/systemd/issues/12268#issuecomment-506953576
looks like TEST-30 is actually failing due to service start timeout (of 25 seconds); this test should be added to the blacklist until we can get it reliably passing.

Revision history for this message
Evgeny Vereshchagin (evvers) wrote :

> it looks like @laney is listed there, which is probably appropriate (he has admin access to the test systems, while I don't)

My understanding is that @laney can help with the infrastructure where the tests are run. What usually happens on Ubuntu CI for the most part has nothing to do with the infrastructure. I'd say it's pretty stable except that for about a month if I recall correctly it failed to report the statuses of PRs but it was fixed by @laney in https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1817344.

Regarding TEST-34 on i386, could you open an issue on GitHub? It could probably be kept on the list as well until it's fixed.

Revision history for this message
Dan Streetman (ddstreet) wrote :

opened https://github.com/systemd/systemd/issues/12932

so TEST-30 and TEST-34 need to go on the blacklist, unless we can figure out how to get them less flaky soon.

Revision history for this message
Dan Streetman (ddstreet) wrote :

@evvers I think Ubuntu CI is at least mostly stable, for now, and hopefully we can get the tests un-blacklisted soon, but since we're doing all that work in github and debian bugs, I don't think we need to keep this bug open, so I'll close it; let me know if you're prefer to keep it open.

Changed in systemd (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.