broken kernel causes eternal test retry loop

Bug #1630578 reported by Martin Pitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Medium
Unassigned
autopkgtest (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

The infra is currently testing linux-meta/4.4.0.41.43 which has some bug that it freezes the testbed (nothing actually runs any more and ssh hangs). The actual test timeout is being spotted correctly, but then it takes another two hours until it finally trips over some cleanup error/timeout -- this then gets counted as a tmpfail and the whole tests runs again:

⟫ tail -f /tmp/autopkgtest-work.s6p8ukov/out/log
06:30:45 DEBUG| [stdout] shm-sysv PASSED
06:30:55 DEBUG| [stdout] sigfd PASSED
06:31:05 DEBUG| [stdout] sigfpe PASSED
06:31:15 DEBUG| [stdout] sigpending PASSED
06:31:25 DEBUG| [stdout] sigq PASSED
06:31:35 DEBUG| [stdout] sigsegv PASSED
06:31:45 DEBUG| [stdout] sigsuspend PASSED
06:31:55 DEBUG| [stdout] sleep PASSED
06:32:05 DEBUG| [stdout] sock PASSED
06:32:16 DEBUG| [stdout] sockfd PASSED

autopkgtest [10:43:18]: ERROR: timed out on command "su -s /bin/bash ubuntu -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest.3eTi1L/build.zNX/linux-4.4.0"; mkdir -p -m 1777 -- "/tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.3eTi1L/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/tmp/autopkgtest.3eTi1L/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=4; unset LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; export 'ADT_TEST_TRIGGERS=linux-meta/4.4.0.41.43'; chmod +x /tmp/autopkgtest.3eTi1L/build.zNX/linux-4.4.0/debian/tests/ubuntu-regression-suite; touch /tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-stdout /tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-stderr; /tmp/autopkgtest.3eTi1L/build.zNX/linux-4.4.0/debian/tests/ubuntu-regression-suite 2> >(tee -a /tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-stderr >&2) > >(tee -a /tmp/autopkgtest.3eTi1L/ubuntu-regression-suite-stdout); " (kind: test)
autopkgtest [10:43:18]: test ubuntu-regression-suite: -----------------------]
^[Unexpected cleanup error:
Traceback (most recent call last):
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 708, in mainloop
    command()
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 646, in command
    r = f(c, ce)
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 584, in cmd_copyup
    copyupdown(c, ce, True)
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 469, in copyupdown
    copyupdown_internal(ce[0], c[1:], upp)
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 570, in copyupdown_internal
    (wh, ['source', 'destination'][sdn], status))
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 84, in bomb
    raise Quit(12, progname + ": failure: %s" % m)
VirtSubproc.Quit: (12, '<VirtSubproc>: failure: copyup source failed, status 255')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 680, in error_cleanup
    cleanup()
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 671, in cleanup
    caller.hook_cleanup()
  File "/home/ubuntu/autopkgtest/virt/autopkgtest-virt-ssh", line 464, in hook_cleanup
    VirtSubproc.downtmp_remove()
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 291, in downtmp_remove
    auxverb + ['rm', '-rf', '--', downtmp])
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 144, in execute_timeout
    (out, err) = sp.communicate(instr)
  File "/usr/lib/python3.5/subprocess.py", line 1064, in communicate
    self.wait()
  File "/usr/lib/python3.5/subprocess.py", line 1658, in wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.5/subprocess.py", line 1608, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
  File "/home/ubuntu/autopkgtest/lib/VirtSubproc.py", line 64, in alarm_handler
    raise Timeout()
VirtSubproc.Timeout

while cleaning up because of another error:
<VirtSubproc>: failure: copyup source failed, status 255
autopkgtest [12:43:19]: ERROR: testbed failure: unexpected eof from the testbed

Tags: patch
Martin Pitt (pitti)
Changed in autopkgtest (Ubuntu):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Martin Pitt (pitti)
milestone: none → ubuntu-16.10
Revision history for this message
Martin Pitt (pitti) wrote :

 * Create a container, install ssh and your ssh key:

  lxc launch autopkgtest/ubuntu/yakkety/amd64 y1
  lxc exec y1 -- apt install -y openssh-server
  lxc file push -p --uid=1000 --gid=1000 ~/.ssh/id_rsa.pub y1/home/ubuntu/.ssh/authorized_keys

 * Get the IP from "lxc list" and verify that you can ssh in without a password.

 * Create this test:

   mkdir -p /tmp/t/debian/tests/
   echo 'Test-Command: pkill -e -STOP sshd; sleep 5' > /tmp/t/debian/tests/control

 * Run it:

   lxc exec y1 -- pkill -e -CONT sshd
   autopkgtest --timeout-test=1 /tmp/t -- ssh -l ubuntu -H $IP

   ($IP is the container's IP)

Revision history for this message
Martin Pitt (pitti) wrote :
Changed in autopkgtest (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
status: In Progress → Fix Committed
assignee: nobody → Martin Pitt (pitti)
Revision history for this message
Martin Pitt (pitti) wrote :

It still seems to happen after that patch.

Changed in autopkgtest (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
status: Fix Committed → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

On the infra it now looks like this:

ERROR - 02:45:43 DEBUG| [stdout]
ERROR - 03:19:00 ERROR| [stderr] Run parser regression tests ... ok
ERROR - 03:19:01 ERROR| [stderr] test_regression_testsuite (__main__.ApparmorTestsuites)
ERROR - 03:19:05 DEBUG| [stdout] preparing apparmor_2.10.95-4ubuntu5.dsc... done
ERROR - 03:19:05 DEBUG| [stdout]
ERROR - Killed
ERROR -
ERROR - Due to the above error(s), this script is unable to continue and is terminating.
ERROR -
Connection to 10.42.46.237 closed by remote host.
autopkgtest-virt-ssh [05:07:27]: ERROR: Removing temporary files on testbed timed out
autopkgtest [05:07:36]: ERROR: testbed failure: testbed auxverb failed with exit code 255

Revision history for this message
Martin Pitt (pitti) wrote :

This is harder to work around/catch, as in the new case the test does *not* time out, it just kills sshd (or something in the kernel that breaks ssh/networking). In general these are cases that we do want to treat as "tmpfail" and auto-restart, I don't want to treat an auxverb failure as failure in general.

Perhaps we need to introduce some kind of retry counter, but this would need to span at least half a day -- three tmpfails on the same worker in a row are usually a sign of a broken cloud or a broken testbed image, not a test failure. So perhaps some logic to check if other tests tmpfail on the same worker/cloud, and if not then call that test a failure.

This would all require state keeping, which we don't currently do (the only state is the AMQP queue contents).

Changed in autopkgtest (Ubuntu):
status: In Progress → Triaged
importance: High → Medium
affects: autopkgtest (Ubuntu) → auto-package-testing
Changed in auto-package-testing:
milestone: ubuntu-16.10 → none
Revision history for this message
Iain Lane (laney) wrote :

We're getting another instance of this now. In this case there's a kernel panic. To deal with that specific case, how about something like:

If it's a tmpfail, look at the console log. Remember if there's a kernel panic. If you see a kernel panic 3 times, treat that as a real failure.

In the other case, zesty/linux/i386, there seems to be some kind of problem with the testsuite OOMing and killing autopkgtest itself. Is there any way we can catch that? Look for kill -9 and check the console log? For now I blacklisted the package as it kept coming back and killing the cloud.

Revision history for this message
Martin Pitt (pitti) wrote :

> If you see a kernel panic 3 times, treat that as a real failure.

Sounds good!

> the testsuite OOMing and killing autopkgtest itself.

That is, it kills the command (i. e. debian/tests/blah) that autopkgtest runs over ssh? Shouldn't the console log also contain some trace of "OOM" which could be used as an indication that it's a real test failure?

Revision history for this message
Iain Lane (laney) wrote : Re: [Bug 1630578] Re: broken kernel causes eternal test retry loop

On Tue, Jan 17, 2017 at 07:42:34AM -0000, Martin Pitt wrote:
> > If you see a kernel panic 3 times, treat that as a real failure.
>
> Sounds good!
>
> > the testsuite OOMing and killing autopkgtest itself.
>
> That is, it kills the command (i. e. debian/tests/blah) that autopkgtest
> runs over ssh? Shouldn't the console log also contain some trace of
> "OOM" which could be used as an indication that it's a real test
> failure?

Yep, you can see things being killed by the OOM killer - do you think
that autopkgtest should mark that as a failure, or the worker? If the
worker, this could be an instance of the above too. Is there something
specific we can look for being killed?

Note that autopkgtest in this case:

> ERROR - 12:55:49 ERROR| [stderr] test_regression_testsuite (__main__.ApparmorTestsuites)
> ERROR - 12:55:54 DEBUG| [stdout] preparing apparmor_2.10.95-4ubuntu5.2.dsc... done
> ERROR - 12:55:54 DEBUG| [stdout]
> ERROR - Killed
> ERROR -
> ERROR - Due to the above error(s), this script is unable to continue and is terminating.
> ERROR -
> Connection to 10.220.44.208 closed by remote host.
> Exit request sent.
> autopkgtest [13:03:42]: ERROR: testbed failure: testbed auxverb failed with exit code 255

isn't outputting the console log. I don't think that debug_failure is
surfaced all the way up to lib/adt_testbed.py, where we see this failure
- do you have an idea about how to get it out? Add a
'hook_debug_failure'?

Cheers,

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message
Iain Lane (laney) wrote :

I've done the autopkgtest-cloud side of this:

  https://git.launchpad.net/~ubuntu-release/+git/autopkgtest-cloud/commit/?id=f9b3ee01e94d10d51a46702c6150e9c9618b6665

by overriding the return code to 4.

This already worked in production (was killing the bos01 cloud before)

  https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-zesty-canonical-kernel-team-unstable/zesty/ppc64el/l/linux/20170126_104055_6654d@/log.gz

Still need to do the autopkgtest side: outputting the console log in more situations.

Revision history for this message
Iain Lane (laney) wrote :

Attaching the following patch

commit 7f37957794d3b65e8d2e236d6af7dafcb2dc79e9
Author: Iain Lane <email address hidden>
Date: Fri Feb 24 12:41:32 2017 +0000

    Add a debug-fail hook and implement it for autopkgtest-virt-ssh

    At the minute, this is mainly so that the nova script can have its
    failure information (`nova console-log') propagated up to the output, so
    that in the case of kernel panics or other random failures we get useful
    output that the driver of autopkgtest (e.g. autopkgtest-cloud) can look
    at.

    https://bugs.launchpad.net/auto-package-testing/+bug/1630578

with this we should get console output after timeout. It's not as good as just receiving the console immediately would be, but nova doesn't provide for that and it's not common/serious enough to poll for IMHO.

Please review!

Revision history for this message
Iain Lane (laney) wrote :

friendly ping

apw wanted to unblacklist the kernel (suspecting the OOM bug to be fixed), so we did that and it turned out it wasn't.

Rather than re-blacklist, I cowboyed this patch live to test it on a production case and it seemed to work:

  https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-yakkety/yakkety/i386/l/linux/20170312_020902_83e4d@/log.gz

Revision history for this message
Martin Pitt (pitti) wrote :
Download full text (4.1 KiB)

Hello Iain,

sorry for the delay!

| From 7f37957794d3b65e8d2e236d6af7dafcb2dc79e9 Mon Sep 17 00:00:00 2001
| From: Iain Lane <email address hidden>
| Date: Fri, 24 Feb 2017 12:41:32 +0000
| Subject: [PATCH] Add a debug-fail hook and implement it for
| autopkgtest-virt-ssh
|
| At the minute, this is mainly so that the nova script can have its
| failure information (`nova console-log') propagated up to the output, so
| that in the case of kernel panics or other random failures we get useful
| output that the driver of autopkgtest (e.g. autopkgtest-cloud) can look
| at.
|
| https://bugs.launchpad.net/auto-package-testing/+bug/1630578
| ---
| lib/VirtSubproc.py | 6 ++++++
| lib/adt_testbed.py | 1 +
| virt/autopkgtest-virt-ssh | 36 ++++++++++++++++++++++++++++--------
| 3 files changed, 35 insertions(+), 8 deletions(-)
|
| diff --git a/lib/VirtSubproc.py b/lib/VirtSubproc.py
| index d76a861..e4845fa 100644
| --- a/lib/VirtSubproc.py
| +++ b/lib/VirtSubproc.py
| @@ -699,6 +699,12 @@ def prepare():
| os.kill(os.getpid(), sig)
| sethandlers(handler)
|
| +def cmd_log_debug_fail(c, ce):
| + cmdnumargs(c, ce)
| + try:
| + adtlog.info(caller.hook_debug_fail())
| + except AttributeError:
| + pass
|

Needs two blank lines between functions (tests/pycodestyle will fail
otherwise). Can you please rename this s/log/auxverb/ to more clearly say what
it's for?

| --- a/virt/autopkgtest-virt-ssh
| +++ b/virt/autopkgtest-virt-ssh
| @@ -163,7 +163,7 @@ def parse_args():
| capabilities += args.capability
|
|
| -def execute_setup_script(command, fail_ok=False):
| +def execute_setup_script(command, fail_ok=False, print_stderr=True):
| '''Run the --setup-script, if given.
|
| Arguments passed after -- to the main program are passed verbatim to the
| @@ -173,10 +173,13 @@ def execute_setup_script(command, fail_ok=False):
|
| :param command: Command to execute. The command must match a function in
| the ssh script
| - :param fail_ok: If True, failures will not cause bombing, and this function
| - returns the exit code instead.
| + :param fail_ok: If True, failures will not cause bombing
| + :return: A tuple (stdout, stderr, return code). stdout and stderr may be
| + None, for example if the script fails and fail_ok is True.

Nitpick: Other functions like VirtSubproc.execute_timeout() return (rc, out, err)
so for consistency it would be nice to use the same order.

| '''
| global sshconfig, args
| + out = None
| + err = None
|
| if args.setup_script:
| fpath = args.setup_script
| @@ -194,14 +197,23 @@ def execute_setup_script(command, fail_ok=False):
| cmd += sshconfig['extraopts'].split(' ')
|
| adtlog.debug('Executing setup script: %s' % ' '.join(cmd))
| - (status, out, err) = VirtSubproc.execute_timeout(
| - None, 1800, cmd, stdout=subprocess.PIPE)
| + (status, o, e) = VirtSubproc.execute_timeout(
| + None,
| + 1800,
| + cmd,
| + stdout=subprocess.PIPE,
| + stderr=subprocess....

Read more...

Revision history for this message
Iain Lane (laney) wrote :

Thanks for the review!

On Mon, Mar 13, 2017 at 10:50:20AM -0000, Martin Pitt wrote:
> Hello Iain,
>

To rearrange slightly

> | + if print_stderr:
> | + # Keep outputting the error on stderr as well as capturing it
> | + sys.stderr.write(e)
>
> This will cause delayed stderr messages when things are hanging and autopkgtest
> runs interactively, but setting up a complex "tee" pipelining here isn't worth
> the effort, so good enough.

Yeah, I know. I had tried some crazy experiments with tee() and splice()

  https://gist.github.com/iainlane/8b96022e3b98901b25cc5bcab6b43e02

...quite fun to work out, but in the end they didn't work because I
can't use asyncio from a non-main thread apparently (at least IIRC that
was the problem). I guess you could also do some refactoring and add an
async method to VirtSubproc that gives file handles back to read from,
but I figured just printing it explicitly would be acceptable.

> | + out = o | + err = e
>
> Why the extra o and e varibles? This looks confusing. Just directly use
> "(status, out, err) = " as before?

Yeah, just a mistake that I left in from the tee stuff - fixed.

> | def hook_open():
> | host_setup('open')
> | @@ -440,7 +460,7 @@ def hook_wait_reboot():
> | global sshcmd
> |
> | if args.setup_script:
> | - rc = execute_setup_script('wait-reboot', fail_ok=True)
> | + (out, err, rc) = execute_setup_script('wait-reboot', fail_ok=True)
>
> Nitpick: Use "rc = execute..()[0]" or use "_" as variable name for unused
> components, i. e. "(rc, _, _) = execute..()".

Sure.

>
> Did you run the tests/autopkgtest SshRunner{No,With}Script tests with
> that?

Yep, they worked. For some reason I couldn't run them with
AUTOPKGTEST_TEST_LXD set to a zesty container - some seemingly unrelated
problems with apt downloading that also happen on master, but xenial
works fine.

New patch attached. I didn't run this version end to end on an example
yet - just triggered, will reply again if it's broken.

Cheers,

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message
Martin Pitt (pitti) wrote :

Thanks! I pushed your patch to the autopkgtest git. Not sure if there is still an infra change to do (for the "Auto Package Testing" task).

Changed in autopkgtest (Ubuntu):
status: New → Fix Committed
tags: added: patch
Revision history for this message
Iain Lane (laney) wrote :

Nope, thank you - that's all done now AFAICS. Thanks for the reviews! :)

Changed in auto-package-testing:
status: Triaged → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package autopkgtest - 5.0

---------------
autopkgtest (5.0) unstable; urgency=medium

  [ Martin Pitt ]
  * Drop obsolete adt-* CLI
    autopkgtest 4.0 with its "autopkgtest" program has been around for a
    year, Debian's and Ubuntu's CI moved to that, and the Debian 9.0 release
    is behind us. Drop "adt-run" and the corresponding adt-virt-* and
    adt-build-* command line interfaces now.
    Add Breaks: for debci for versions that still relied on adt-*.
  * lxc, qemu, ssh: Shorten path of shared directory.
    This helps to reduce socket path lengths in tests.
    (See Debian #860554, LP: #1680577)
  * Drop obsolete ubuntu-touch-session setup script.
    This hasn't been maintained/tested for ages, and with the demise of
    Ubuntu Touch this is now completely obsolete. (Closes: #835836)
  * Avoid bare "except:" clauses
  * tests/autopkgtest_args: pyflakes/pycodestyle fixes
  * tests: Don't run LXC tests with --ephemeral.
    It's not the default and ephemeral containers tend to fail to start too
    often.
  * tests/testpkg: Update Standards-Version and debhelper level.
    Priority "extra" is deprecated, so move to "optional".
  * README.package-tests.rst: Clarify role of control file, add simple
    examples (Closes: #870654)
  * README.package-tests.rst: Update Testsuite: documentation.
    All supported Debian releases now have a new enough dpkg, and Ubuntu
    14.04 will not get new tests any more. (Closes: #876008)
  * Run apt-ftparchive for local debs on testbed instead of host.
    This is slightly more correct as it will produce the format that the
    testbed's apt expects (although that doesn't matter much). It also
    allows autopkgtest to be run on non-Debian hosts.
    As apt-utils is not essential, install it if necessary.
  * README.package-tests.rst: Clarify Testsuite: values
    `autopkgtest` is not any more the only defined value for Testsuite:, as
    the paragraph below already explains.
  * setup-commands/setup-testbed: Look at /etc/os-release for determining
    distribution name. Stop hardcoding "debian|ubuntu|kali" and get this
    information from os-release's ID field. (Closes: #868678)
  * debian/rules: Drop obsolete pysupport override.
    Not necessary any more since dh compat level 9.
  * Bump Standards-Version to 4.1.0. (No changes necessary.)

  [ Stéphane Graber ]
  * debian/tests/lxd: Update to support LXD network/storage API

  [ Steve Langasek ]
  * Fix release name parsing with arch-qualified sources.list
    (Closes: #870393)

  [ Antonio Terceiro ]
  * autopkgtest-build-lxc: Correctly clean cache for Debian containers

 -- Martin Pitt <email address hidden> Sun, 17 Sep 2017 16:50:21 +0200

Changed in autopkgtest (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.