Ubuntu
linux package

testsuite fails under qemu (SIGILL) works fine on real hw [missing getrandom 384 syscall]

Bug #1707409 reported by Gianfranco Costamagna on 2017-07-29

This bug affects 1 person

	Status	Importance	Assigned to
launchpad-buildd	New	Undecided	Unassigned
linux (Ubuntu)	Invalid	High	Unassigned
qemu (Ubuntu)	Invalid	High	Unassigned

Bug Description

Hello, after spending a lot of time debugging notmuch failure under armhf, we came to a conclusion:

it started to fail when the infra moved to a new kernel 3.2 to a 4.2, and moved under qemu/kvm environment.

the latest successful build is here created on 2016-03-13 https://launchpad.net/ubuntu/+source/notmuch/0.21-3ubuntu2/+build/9344826

and the first bad is this one: Started on 2016-08-31 https://launchpad.net/ubuntu/+source/notmuch/0.22.1-2ubuntu1/+build/10600002

Kernel version: Linux kishi10 3.2.0-98-highbank #138-Ubuntu SMP PREEMPT Mon Jan 11 13:24:41 UTC 2016 armv7l
Kernel version: Linux bos01-arm64-024 4.2.0-42-generic #49-Ubuntu SMP Tue Jun 28 21:24:20 UTC 2016 aarch64

so, in the first case armhf was ran on top of an armv7 kernel, in the other case it became an arm64 one
this might not even be a regression in qemu/kvm, but rather a change in buildd system that spot a new bug

doing a xenial build failed aswell, so I presume this is not a toolchain regression (also because it works fine on real HW), but a qemu/linux bug.

I did run the test under strace/valgrind, I can't do much more testing, but I hope the logs can be useful for you
https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/locutusofborg-ppa/+build/13169431
https://launchpadlibrarian.net/331134898/buildlog_ubuntu-artful-armhf.notmuch_0.25-2ubuntu1_BUILDING.txt.gz

You can see the strace/valgrind outputs between "BEGIN" and "END" keywords

See original description

Tags:

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-07-29:

I'm assigning launchpad, maybe somebody can try notmuch/armhf with an updated qemu or a downgraded kernel :)

Changed in linux (Ubuntu):
importance:	Undecided → High
Changed in qemu (Ubuntu):
importance:	Undecided → High
description:	updated

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-07-29: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1707409

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: precise

Revision history for this message

Mattia Rizzolo (mapreri) wrote on 2017-07-29: Re: testsuite fails under qemu (SIGILL) works fine on real hw

Yes, the nature of the bug makes impossible to do as requested by the bot…

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Gianfranco Costamagna (costamagnagianfranco) on 2017-07-29

tags:

added: bot-stop-nagging xenial
removed: precise

Revision history for this message

Colin Watson (cjwatson) wrote on 2017-07-31:

I think it's very unlikely indeed that this is a Launchpad bug, and we're not here to go on fishing expeditions for you testing random things. qemu/kernel developers are generally better placed to be able to bisect this sort of thing. Of course you can reopen this if there turns out to be some evidence that this is in fact a Launchpad bug.

affects:	launchpad → launchpad-buildd
Changed in launchpad-buildd:
status:	New → Invalid

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-08-08:

Download full text (6.1 KiB)

TL;DR:
- you can use the pbuilder + static qemu setup to debug
- qemu/libvirt throw no error
- the tests do not "consider" the unsupported syscall
- I found to get just the same issues on Artful but with more context indicating that missing syscall
- You can use the setup described above (or sbuild) to debug further as there are 1-2 issues which seem to have other reasons e.g. "Xapian exception: read only files"
- the testcases might need several fixes, but one of them is surely to test not only if gdb is installed, but if it is working.

---

Hi Gianfranco,
Lacking a working arm system atm to test any further I tried this in qemu static.
Might be an odd setup, but I had it around from evrifying another bug.

pbuilder-dist artful armhf create
pbuilder-dist artful armhf login
apt install ubuntu-dev-tools vim-nox fakeroot
pull-lp-source notmuch
# enable deb-src in /etc/apt/sources.list
apt update
apt-get build-dep notmuch
cd notmuch
debuild -uc -us

(tried the same in Trusty with the old version of notmuch)
Both had a few testcase errors - but not exactly the same you had.

But after the build running dh_auto_test manually in that dir shows more output that can help.

With that approach I think I saw some issues that might be related as with local dh_autp_test you get mroe details.
- Xapian exception: read only files - that could be your DB errors

I quite often saw tests skipped for gdb missing, on an older build environment that was around.
So for a test I installed gdb and reran the tests and found what might be related.
Gdb in the virtual environment (at least my qemu-statis-armhf one) could not work due to an Unsupported syscall.
The retval of gdb in that was 255, and that reminded me of your log:
FAIL success exit with --keep when add_message returns READ_ONLY_DATABASE
--- T070-insert.33.expected 2016-08-31 07:10:21.960346786 +0000
+++ T070-insert.33.output 2016-08-31 07:10:21.960346786 +0000
@@ -1 +1 @@
-0
+255
While I saw:
FAIL success exit with --keep when add_message returns READ_ONLY_DATABASE
exit code 255, expected 0 gdb --batch-silent --return-child-result -ex 'set args insert --keep < /usr/notmuch-0.25/test/tmp.T070-insert/mail/msg-018' -x index-file-READ_ONLY_DATABASE.gdb notmuch
qemu: Unsupported syscall: 26

Not exactly the same, but the question is what is throwing the 255 in your case, but as I said my setup seems insufficient for that. Maybe the "full" qemu running arm on arm supports that system call but fails differently?

I wondered, in your buildlog [1] I see shell syntax errors like:
./T380-atomicity.sh: line 79: ((: i < : syntax error: operand expected (error token is "< ")
But when running locally (before installing gdb) I saw:
missing prerequisites: gdb(1)
SKIP all tests in T380-atomicity
With gdb installed I get exactly your error on the "./T380-atomicity.sh: line 79:" case.
I also saw that at least back in Yakkety gdb was a build dep but with various arch restrictions:
Build-Depends: gdb-minimal, gdb [!s390x !ia64 !armel !ppc64el !mips !mipsel !mips64el]
Build-Conflicts: ruby1.8, gdb-minimal, gdb [s390x ia64 armel ppc64el mips mipsel mips64el]

Note - the same on a Trusty pbuil...

---

Hi Gianfranco,
Lacking a working arm system atm to test any further I tried this in qemu static.
Might be an odd setup, but I had it around from evrifying another bug.

(tried the same in Trusty with the old version of notmuch)
Both had a few testcase errors - but not exactly the same you had.

But after the build running dh_auto_test manually in that dir shows more output that can help.

With that approach I think I saw some issues that might be related as with local dh_autp_test you get mroe details.
- Xapian exception: read only files - that could be your DB errors

I quite often saw tests skipped for gdb missing, on an older build environment that was around.
So for a test I installed gdb and reran the tests and found what might be related.
Gdb in the virtual environment (at least my qemu-statis-armhf one) could not work due to an Unsupported syscall.
The retval of gdb in that was 255, and that reminded me of your log:
FAIL   success exit with --keep when add_message returns READ_ONLY_DATABASE
	--- T070-insert.33.expected	2016-08-31 07:10:21.960346786 +0000
	+++ T070-insert.33.output	2016-08-31 07:10:21.960346786 +0000
	@@ -1 +1 @@
	-0
	+255
While I saw:
 FAIL   success exit with --keep when add_message returns READ_ONLY_DATABASE
        exit code 255, expected 0 gdb --batch-silent --return-child-result           -ex 'set args insert --keep < /usr/notmuch-0.25/test/tmp.T070-insert/mail/msg-018'       -x index-file-READ_ONLY_DATABASE.gdb notmuch
qemu: Unsupported syscall: 26

I wondered, in your buildlog [1] I see shell syntax errors like:
./T380-atomicity.sh: line 79: ((: i < : syntax error: operand expected (error token is "< ")
But when running locally (before installing gdb) I saw:
 missing prerequisites: gdb(1)
 SKIP   all tests in T380-atomicity
With gdb installed I get exactly your error on the "./T380-atomicity.sh: line 79:" case.
I also saw that at least back in Yakkety gdb was a build dep but with various arch restrictions:
Build-Depends: gdb-minimal, gdb [!s390x !ia64 !armel !ppc64el !mips !mipsel !mips64el]
Build-Conflicts: ruby1.8, gdb-minimal, gdb [s390x ia64 armel ppc64el mips mipsel mips64el]

Note - the same on a Trusty pbuilder target instead of Artful ran all but one test.
So maybe it is neither LP, nor qemu but some of the dependencies that broke this?

Your build log I compared to was actually Yakkety so I picked that finally to check if this is fully reproducible.
There gdb could be around by default (just as it was in trusty for me) due to dependencies.
I slimlined the repro to get less extra deps:
pbuilder-dist yakkety armhf create
pbuilder-dist yakkety armhf login
# enable deb-src in /etc/apt/sources.list
apt update
apt-get source notmuch
apt-get build-dep notmuch
apt install fakeroot devscripts
cd notmuch
debuild -uc -us

And here things confirmed - it was the unsupported system call and on yakktey the output matches your case:
T060-count: Testing "notmuch count" for messages and threads
 FAIL   error message from query_search_messages
        --- T060-count.14.EXPECTED      2017-08-08 09:01:36.000000000 +0000
        +++ T060-count.14.OUTPUT.clean  2017-08-08 09:01:36.000000000 +0000
        @@ -1,3 +1 @@
        -notmuch count: A Xapian exception occurred
        -A Xapian exception occurred performing query
        -Query string was: *
        +qemu: Unsupported syscall: 26

T070-insert: Testing "notmuch insert"
 FAIL   error exit when add_message returns OUT_OF_MEMORY
        --- T070-insert.26.expected     2017-08-08 09:01:46.000000000 +0000
        +++ T070-insert.26.output       2017-08-08 09:01:46.000000000 +0000
        @@ -1 +1 @@
        -1
        +255
qemu: Unsupported syscall: 26

Interesting might also be that despite the same issues your build log has it ran not into the final shell errors you had. So it continues to build despite the failed tests:

FATAL: ./T380-atomicity.sh: interrupted by signal -128
test/Makefile.local:64: recipe for target 'test' failed
make[2]: *** [test] Error 124
make[2]: Leaving directory '/root/notmuch-0.22.1'
dh_auto_test: make -j1 test VERBOSE=1 returned exit code 2
make[1]: Leaving directory '/root/notmuch-0.22.1'
 fakeroot debian/rules binary

And with that finally I found:
override_dh_auto_test:
ifeq ($(DEB_HOST_ARCH),armhf)
        TERM=vt100 dh_auto_test || true
Which means it is not meant/expected to work properly on archhf.

The difference that has to be found is why/how it breaks out of that || true
OTOH that is no more in latter versions like Artful, so the fix might be to actually fix that for virtual environments that are armhf, but do not support that syscall.

That now known it might also be reproducible via sbuild which I usually prefer, but I rarely hav cross compile setups and struggled with it trying for this case.

I lack the arm HW to check any further on arm64 host in particular, but as outlined above don't tihnk it is needed.
IMHO the testcases might need a few fixes, one of them is to test not only if gdb is installed, but if it is working.
You can use the setup described above to debug further - is that ok for you?

[1]: https://launchpadlibrarian.net/281860948/buildlog_ubuntu-yakkety-armhf.notmuch_0.22.1-2ubuntu1_BUILDING.txt.gz

Changed in qemu (Ubuntu):
status:	New → Incomplete

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-08-08:

>And with that finally I found:
>override_dh_auto_test:
>ifeq ($(DEB_HOST_ARCH),armhf)
> TERM=vt100 dh_auto_test || true
>Which means it is not meant/expected to work properly on archhf.

sure, I put that *because* of this bug :) it is quite the opposite, since qemu or something else throws illegal instruction, I had to disable the testsuite.

please use the debian package when doing things, the Ubuntu one has the gdb missing and other hacks (and please use the latest version)

(I'm trying with the same setup as you, but I don't think I have the knowledge to trace down this bug further)

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-08-08:

+qemu: Unsupported syscall: 384
+qemu: Unsupported syscall: 26
+exit status: 255

I would say that missing syscall 384 and 26 are the culprit?

Changed in qemu (Ubuntu):
status:	Incomplete → Confirmed
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-08-08:

well, some tests are failing just because of 384 non implemented on artful
T150-tagging: Testing "notmuch tag"
FAIL Xapian exception: read only files
--- T150-tagging.24.expected 2017-08-08 10:28:25.000000000 +0000
+++ T150-tagging.24.output 2017-08-08 10:28:25.000000000 +0000
@@ -1 +1 @@
-A Xapian exception occurred opening database
+

qemu: Unsupported syscall: 384

so, 26 is not a problem

384 is getrandom?

summary:

- testsuite fails under qemu (SIGILL) works fine on real hw
+ testsuite fails under qemu (SIGILL) works fine on real hw [missing
+ getrandom 384 syscall]

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-08-09:

I agree on 384 being getrandom and there are other cases like [1].
I don't know details but would consider the pure lack of the system call support a feature request to upstream qemu.
We could add a qemu upstream task here for that FR - in the worst case we are told why we are wrong - opinions?

On the case itself for now I'd recommend to get back the section:
ifeq ($(DEB_HOST_ARCH),armhf)
TERM=vt100 dh_auto_test || true

Which maybe was there for similar reasons.

[1]: https://users.rust-lang.org/t/missing-system-calls-when-running-tests-under-qemu-arm-at-travisci/5013

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-08-09:

#10

>I agree on 384 being getrandom and there are other cases like [1].

I did the same research :)

>I don't know details but would consider the pure lack of the system call support a feature request to >upstream qemu.
>We could add a qemu upstream task here for that FR - in the worst case we are told why we are wrong - >opinions?

yes please!

>On the case itself for now I'd recommend to get back the section:
>ifeq ($(DEB_HOST_ARCH),armhf)
> TERM=vt100 dh_auto_test || true
>
>Which maybe was there for similar reasons.

I think this used to hang the builders, IIRC, but seems to pass now
https://launchpadlibrarian.net/332584378/buildlog_ubuntu-artful-armhf.notmuch_0.25-4ubuntu1_BUILDING.txt.gz

uploaded in artful

Revision history for this message

Riku Voipio (riku-voipio) wrote on 2017-08-10:

#11

There is two issues being mixed up here

1) launchpad buildd changes.

notmuch build system appears to be confused by the new enviroment. It appears ubuntu has chosen "armhf chroot on arm64 machine" approach, which would mean qemu system call emulation is not involved. if it is - it a buildd configuration error.

My suspicion is that notmuch testsuite gets confused in armhf-on-arm64 setup. This setup is a bit shoddy and launchpad should really run the armhf builders with armhf kernel, which kvm on arm64 host can easily do.

2) The "unsupported syscall" errror.

the pbuilder-based cross-build uses qemu linux-user, so the build env is not equivalent to what is launchpad.

syscall 384 is getrandom, which qemu does support. You may have too old qemu

Gianfranco Costamagna (costamagnagianfranco) on 2017-08-10

Changed in launchpad-buildd:
status:	Invalid → New
Changed in linux (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-08-10:

#12

Well, LP builders have qemu 2.5, so this might be true for the syscall.

However, I appreciate the first point, this is in-line with my expectations and might be solvable by launchpad buildd team admins.

Can you please have a deeper look now?

thanks!

Revision history for this message

Colin Watson (cjwatson) wrote on 2017-09-06:

#13

qemu isn't involved. We intentionally run armhf builds on an arm64 kernel (with an appropriate personality set) because this allows us to make denser use of our build resources; I don't expect this to change.

Revision history for this message

Gianfranco Costamagna (costamagnagianfranco) wrote on 2017-09-06:

#14

Interesting, so somewhat the system is triggering an illegal arm64 instruction? Can be that gdb needs somebody telling it to use armhf platform?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-06-27:

#15

Setting qemu task to invalid per Colins explanation that the LP case doesn't use it.

Changed in qemu (Ubuntu):
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

testsuite fails under qemu (SIGILL) works fine on real hw [missing getrandom 384 syscall]

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package