test results get lost and stay "in progress" forever when multiple tests finish at the exact same time

Bug #1571353 reported by Timo Jyrinki
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Medium
Martin Pitt
Bileto
Invalid
Undecided
Unassigned

Bug Description

It would be nice to be able to retry also "Test in progress" autopkgtests. Every now and then an autopkgtest is somehow hanged and not disappearing from the excuses page even after extended amount of time, but it also can't be retried because the status about its failing is not reaching the right places.

(currently ubuntuone-credentials amd64 at https://requests.ci-train.ubuntu.com/static/britney/xenial/landing-050/excuses.html and last week another test on another arch)

Revision history for this message
Martin Pitt (pitti) wrote :

Sorry, but I don't think this is the right way. Many tests are legitimately "running" for a long time because the queues are just full, and I really don't want folks to make this even worse by sending even more retries.

Let's rather find out what causes those "lost" test requests. The next time you encounter one, can you please point me to it? https://requests.ci-train.ubuntu.com/static/britney/xenial/landing-050/excuses.html is fine right now, so that got sorted out already.

Changed in auto-package-testing:
status: New → Incomplete
summary: - Allow retry of "Test in progress" autopkgtests
+ test requests sometimes get lost and stay "in progress" forever
Revision history for this message
Martin Pitt (pitti) wrote : Re: test requests sometimes get lost and stay "in progress" forever
Download full text (3.7 KiB)

https://requests.ci-train.ubuntu.com/static/britney/vivid/landing-010/excuses.html has another current case:

qtmultimedia-opensource-src (5.4.1-1ubuntu19~overlay3~vivid1~test4 to 5.4.1-1ubuntu19~overlay4)
[...]
    autopkgtest for pyqt5 5.4.1+dfsg-2: amd64: Test in progress, armhf: Pass [artifacts], i386: Pass [artifacts], ppc64el: Pass [artifacts]

I checked the worker logs, and the latest request for this came in on April 13 and was run successfully:

2016-04-13 12:33:42,481 INFO:worker Received request for package pyqt5 on vivid/amd64; params: {u'ppas': [u'ci-train-ppa-service/stable-phone-overlay', u'ci-train-ppa-service/landing-010'], u'triggers': [u'qtmultimedia-opensource-src/5.4.1-1ubuntu19~overlay4']}
2016-04-13 12:33:42,710 INFO:worker Running /home/ubuntu/autopkgtest/run-from-checkout --output-dir /tmp/adt-work.2LBsuD/out --timeout-copy=6000 --setup-commands if grep -q trusty /etc/lsb-release; then apt-get install -y build-essential; fi; [ ! -d /usr/share/doc/lxd ] || $(which eatmydata || true) apt-get purge --auto-remove -y lxd; apt-get purge --auto-remove -y ubuntu-snappy-cli || true --setup-commands /home/ubuntu/autopkgtest/setup-commands/setup-testbed --setup-commands apt-key adv --keyserver keyserver.ubuntu.com --recv-key B4C396DF0DD9A0ED00B21D2885504128ECF1204C --setup-commands REL=$(sed -rn "/^(deb|deb-src) .*(ubuntu.com|ftpmaster)/ { s/^[^ ]+ +(\[.*\] *)?[^ ]* +([^ -]+) +.*$/\2/p; q }" /etc/apt/sources.list); echo "deb http://ppa.launchpad.net/ci-train-ppa-service/stable-phone-overlay/ubuntu $REL main" > /etc/apt/sources.list.d/autopkgtest-ci-train-ppa-service-stable-phone-overlay.list; echo "deb-src http://ppa.launchpad.net/ci-train-ppa-service/stable-phone-overlay/ubuntu $REL main" >> /etc/apt/sources.list.d/autopkgtest-ci-train-ppa-service-stable-phone-overlay.list; --setup-commands apt-key adv --keyserver keyserver.ubuntu.com --recv-key B4C396DF0DD9A0ED00B21D2885504128ECF1204C --setup-commands REL=$(sed -rn "/^(deb|deb-src) .*(ubuntu.com|ftpmaster)/ { s/^[^ ]+ +(\[.*\] *)?[^ ]* +([^ -]+) +.*$/\2/p; q }" /etc/apt/sources.list); echo "deb http://ppa.launchpad.net/ci-train-ppa-service/landing-010/ubuntu $REL main" > /etc/apt/sources.list.d/autopkgtest-ci-train-ppa-service-landing-010.list; echo "deb-src http://ppa.launchpad.net/ci-train-ppa-service/landing-010/ubuntu $REL main" >> /etc/apt/sources.list.d/autopkgtest-ci-train-ppa-service-landing-010.list; --apt-upgrade pyqt5 --env=ADT_TEST_TRIGGERS=qtmultimedia-opensource-src/5.4.1-1ubuntu19~overlay4 --- ssh -s /home/ubuntu/autopkgtest/ssh-setup/nova -- --flavor autopkgtest --name adt-vivid-amd64-pyqt5-20160413-123342 --image ubuntu/ubuntu-vivid-.*-amd64-server --keyname testbed-juju-prod-ues-proposed-migration-machine-3 --net-id=net_ues_proposed_migration -e 'http_proxy=http://squid.internal:3128' -e 'https_proxy=http://squid.internal:3128' -e 'no_proxy=127.0.0.1,127.0.1.1,localhost,localdomain,novalocal,internal,archive.ubuntu.com,security.ubuntu.com,ddebs.ubuntu.com,ppa.launchpad.net' --mirror=http://ftpmaster.internal/ubuntu
2016-04-13 12:40:46,289 INFO:worker adt-run exited with code 0
2016-04-13 12:40:46,315 INFO:worker Putting results into swift autopkgtest-vivid-...

Read more...

Revision history for this message
Martin Pitt (pitti) wrote :

I wanted to take a look at the britney logs for this. In https://requests.ci-train.ubuntu.com/static/britney/log_20160419_055002.txt I see

2016-04-19T05:50:17+0000 Building britney indexes...
‘/tmp/britney_data/landing-010-vivid/Hints’ -> ‘/tmp/britney_hints’
make: `/tmp/britney_data/landing-010-vivid.conf' is up to date.
2016-04-19T05:50:17+0000 /var/lib/britney/britney.py --config /tmp/britney_data/landing-010-vivid.conf --series vivid
W: [Tue Apr 19 05:50:17 2016] - WARNING: No block hints at all, not even udeb ones!
E: [Tue Apr 19 05:52:04 2016] - https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-vivid-ci-train-ppa-service-landing-010/vivid/armhf/u/unity8/20160129_071038@/result.tar is damaged, ignoring: "filename 'testinfo.json' not found"
2016-04-19T05:52:05+0000 /var/lib/bileto/britney/fetch-indexes done.

which doesn't say much. All other britney logs are similarly short.

Robert, can you please enable -v for britney again for the time being? Or do a manual britney run on this silo with -v and get me the log?

Changed in auto-package-testing:
assignee: nobody → Martin Pitt (pitti)
Revision history for this message
Martin Pitt (pitti) wrote :

Robert enabled britney verbosity for the time being, so looking at https://requests.ci-train.ubuntu.com/static/britney/log_20160419_095502.txt

The log shows

I: [Tue Apr 19 09:56:55 2016] - Read pending requested tests from /tmp/britney_data/landing-010-vivid/autopkgtest/pending.json: {'qtmultimedia-opensource-src/5.4.1-1ubuntu19~overlay4': {'pyqt5': ['amd64']}}
[...]
I: [Tue Apr 19 09:56:55 2016] - Fetched test result for pyqt5/5.4.1+dfsg-2/amd64 20160413_124046@ (triggers: ['qtmultimedia-opensource-src-gles/5.4.1-1ubuntu19~overlay4']): pass
I: [Tue Apr 19 09:56:55 2016] - -> does not match any pending request for pyqt5/amd64
I: [Tue Apr 19 09:56:55 2016] - Test pyqt5/amd64 for qtmultimedia-opensource-src/5.4.1-1ubuntu19~overlay4 is already pending, not queueing

which is actually right -- the aforementioned test log [1] is actually for -gles, not for the "main" qt.

So on one worker we got the test request for the main Qt trigger:

2016-04-13 12:33:42,481 INFO:worker Received request for package pyqt5 on vivid/amd64; params: {u'ppas': [u'ci-train-ppa-service/stable-phone-overlay', u'ci-train-ppa-service/landing-010'], u'triggers': [u'qtmultimedia-opensource-src/5.4.1-1ubuntu19~overlay4']}
[...]
2016-04-13 12:40:46,315 INFO:worker Putting results into swift autopkgtest-vivid-ci-train-ppa-service-landing-010 vivid/amd64/p/pyqt5/20160413_124046@

On another worker we got the request for the -gles trigger:

2016-04-13 12:33:44,122 INFO:worker Received request for package pyqt5 on vivid/amd64; params: {u'ppas': [u'ci-train-ppa-service/stable-phone-overlay', u'ci-train-ppa-service/landing-010'], u'triggers': [u'qtmultimedia-opensource-src-gles/5.4.1-1ubuntu19~overlay4']}
[...]
2016-04-13 12:40:46,807 INFO:worker adt-run exited with code 0
2016-04-13 12:40:46,830 INFO:worker Putting results into swift autopkgtest-vivid-ci-train-ppa-service-landing-010 vivid/amd64/p/pyqt5/20160413_124046@

which explains what happens here: both tests happened to finish in the exact same second, but are for different triggers. Thus we get only one log in swift.

We need to further disambiguate the path in swift to avoid this collision. We originally had this in the spec (https://wiki.debian.org/debci/DistributedSpec?action=diff&rev2=33&rev1=32) but later removed it as it seemed exceptionally unlikely to run the same package test twice at the same second. This is still true for Debian's test machinery and it used to be true before we moved to per-trigger tests, but with the latter it can happen.

[1] https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-vivid-ci-train-ppa-service-landing-010/vivid/amd64/p/pyqt5/20160413_124046@/log.gz

Changed in auto-package-testing:
status: Incomplete → Triaged
importance: Undecided → Medium
summary: - test requests sometimes get lost and stay "in progress" forever
+ test results get lost and stay "in progress" forever when multiple tests
+ finish at the exact same time
Revision history for this message
Martin Pitt (pitti) wrote :

Not a bileto bug. Robert, you can disable the verbosity again. Thanks!

Changed in bileto:
status: New → Invalid
Martin Pitt (pitti)
Changed in auto-package-testing:
milestone: none → ubuntu-16.10
Martin Pitt (pitti)
Changed in auto-package-testing:
status: Triaged → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

I originally was thinking about appending the worker host name or IP, or some abbreviation of that. But these would be rather long, still potentially colliding on abbreviation, and still would not help in the cloud worker case where 80 workers are running on one machine.

So I think a better and more elegant solution is to calculate a hash over the JSON test parameters (triggers, PPAs, etc.) and take the first few digits of that. Then the exact same test fired twice gets a reproducible stamp, and it encodes what we actually want (disambiguate tests with different parameters).

Revision history for this message
Martin Pitt (pitti) wrote :
Changed in auto-package-testing:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.