devices on devel-proposed/ubuntu do not boot with systemd 227-2ubuntu1

Bug #1512323 reported by Jean-Baptiste Lallement
78
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
Critical
Steve Langasek
android (Ubuntu)
Triaged
High
Unassigned
systemd (Ubuntu)
Fix Released
Critical
Steve Langasek

Bug Description

Last known good build: mako devel-proposed/ubuntu r336

Affects builds from devel-proposed/ubuntu starting from first Xenial build.

Test Case:
Boot in fastboot mode and flash with:
$ ubuntu-device-flash -v touch --channel=ubuntu-touch/devel-proposed/ubuntu --bootstrap
or
upgrade from 336 to latest devel-proposed image

Actual Result
During a successful flashing operation of a device from fastboot, it boots once into recovery, then reboots and a rotating Ubuntu logo is displayed while the devices is being flashed.
With this issue, on the second stage of the flash the phone is stuck on the vendor's logo (google or bq) and the rotating ubuntu logo is never displayed.

Same problem on krillin devel-proposed/krillin.en 235

Workaround:
Boot into recovery, mount the system partition and downgrade systemd to 225-1ubuntu9

summary: - Cannot filash mako with devel-proposed/ubuntu - stuck on google logo
+ Cannot flash mako with devel-proposed/ubuntu - hangs on google logo
+ during flashing process
description: updated
Revision history for this message
Dave Morley (davmor2) wrote : Re: Cannot flash mako with devel-proposed/ubuntu - hangs on google logo during flashing process

Confirming this issue, on mako on devel-proposed.

Changed in canonical-devices-system-image:
status: New → Confirmed
Changed in canonical-devices-system-image:
importance: Undecided → Critical
summary: - Cannot flash mako with devel-proposed/ubuntu - hangs on google logo
- during flashing process
+ Cannot flash devices with devel-proposed/ubuntu - hangs on
+ manufacturer's logo during flashing process
description: updated
description: updated
Revision history for this message
Barry Warsaw (barry) wrote : Re: Cannot flash devices with devel-proposed/ubuntu - hangs on manufacturer's logo during flashing process

So here's what happens for me on my bq.

$ ubuntu-device-flash -v touch --channel=ubuntu-touch/devel-proposed/ubuntu --bootstrap

Then I see (this is the second invocation so it uses the cached files):

% ubuntu-device-flash -v touch --channel=ubuntu-touch/devel-proposed/ubuntu --bootstrap
2015/11/10 16:30:21 Expecting the device to be in the bootloader... waiting
2015/11/10 16:30:21 Device is |krillin|
2015/11/10 16:30:22 Flashing version 367 from ubuntu-touch/devel-proposed/ubuntu channel and server https://system-image.ubuntu.com to device krillin
Failed to enter Recovery

This is while the logo is stuck on the unspinning Ubuntu logo.

After a while, the boot seems to time out and it reboots successfully, but what's on the device appears to be the original image:

% adb shell
phablet@ubuntu-phablet:~$ system-image-cli --version
system-image-cli 3.0.2
phablet@ubuntu-phablet:~$ system-image-cli --info
current build number: 170
device name: krillin
channel: ubuntu-touch/rc-proposed/bq-aquaris.en
last update: 2015-11-10 16:17:07
version version: 170
version ubuntu: 20151110
version device: 20150821-736d127
version custom: 20151110--36-43-vivid

Revision history for this message
Barry Warsaw (barry) wrote :

Not that I expected anything, but /var/log/system-image/client.log looks fine. I wonder what a switch-channel operation in system-image-cli will do...

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

@Barry you have to flash with an adb enabled recovery image otherwise the system cannot enter into recovery to apply the image.

The adb enabled recovery image is there
http://people.canonical.com/~jhm/barajas/recovery.img

and the command to flahs from fastboot is
ubuntu-device-flash touch --bootstrap --channel ubuntu-touch/devel-proposed/krillin.en --recovery-image recovery.img

Alternatively you can flash a mako which doesn't require a specific recovery image.

Revision history for this message
Barry Warsaw (barry) wrote :

system-image-cli -vv --switch ubuntu-touch/devel-proposed/ubuntu

On reboot I did see the spinning Ubuntu logo for quite a long time, but eventually I got the bq screen, where it now seems to be hung.

Revision history for this message
Barry Warsaw (barry) wrote :

With the help of folks on IRC, I've unbricked my phone. However you say:

> Last known good build: mako devel-proposed/ubuntu r336

but flashing to r336 does not give me a functional wifi. I cannot connect to my ssid, either through the ui or command line.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

"Last known good build" refers to the problem described in this report (flashing a device with devel-proposed fails) not the general status of devel-proposed which is far from good.
So yes, wifi might be broken on 336, so are scopes, several apps and other components but with 337 (first xenial build) and higher the device cannot even be flashed.

I just tried with build devel-proposed/ubuntu 348 on mako , and the flashing process still hangs on the Google logo.

Revision history for this message
Barry Warsaw (barry) wrote :

Through a painful process of bisecting, I've found that r356 is the problematic release. r355 boots my phone fine, but r356 hangs at the vendor logo. In all cases I did a completely clean reflash. So now I'll pick apart the r355-356 delta and see what might have changed, and also verify that flashing to r355 and upgrading to r356 demonstrates the problem.

Revision history for this message
Barry Warsaw (barry) wrote :

Here's an ls -lR of the system/ directory from the 355-356 delta: http://paste.ubuntu.com/13231503/

Here's a cat of the `removed` file: http://paste.ubuntu.com/13231514/

Revision history for this message
Steve Langasek (vorlon) wrote :

Barry, can you get a package-wise description of the delta between these images, please? It should be sufficient to diff the contents of /var/lib/dpkg/status (not with diff directly which would cope badly with reordered contents, but with dpkg -l --admindir)

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

336 is fine on mako including wifi, so I was able to test some things (wily only of course).

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

Here is a diff between packages on devel-proposed/ubuntu/mako 336 and devel-proposed/ubuntu/mako 337

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

On the broken device there are 3 crash files:
systemd-logind
systemd-udevd
urfkilld

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

On devel-proposed/ubuntu mako 337 the system boots successfully after downgrading the following packages from 227-2ubuntu1 to 225-1ubuntu9

libpam-systemd:armhf
libsystemd0:armhf
libudev1:armhf
systemd
udev

Changed in systemd (Ubuntu):
importance: Undecided → Critical
Changed in canonical-devices-system-image:
assignee: nobody → Steve Langasek (vorlon)
summary: - Cannot flash devices with devel-proposed/ubuntu - hangs on
- manufacturer's logo during flashing process
+ devices on devel-proposed/ubuntu do not boot with systemd 227-2ubuntu1
Revision history for this message
Steve Langasek (vorlon) wrote :

Martin, it looks like we have a critical regression in systemd for the phone in xenial (which of course didn't stop in xenial-proposed now that we have no automated phone boot tests). Please look at this ASAP.

Changed in systemd (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Steve Langasek (vorlon) wrote :

stack trace points at new hashtable functionality within systemd.

The crash shows udevd is crashing with SIGBUS.

The new siphash24.c code is full of dereferences of uint8_t types that are then being handled as uint64_t values, as in le64toh() which is defined as:

static inline uint64_t le64toh(le64_t value) { return bswap_64_on_be((uint64_t __force)value); }

There is nothing in this code that guarantees 64-bit alignment of the source pointer. Dereferencing an unaligned pointer as a 64-bit int is non-portable, and it's precisely a SIGBUS that is raised in the case of unaligned access.

Revision history for this message
Barry Warsaw (barry) wrote :

Oh wow, Launchpad just stopped sending me emails on this issue and I got side tracked on some changes to system-image that I thought might help debug the problem. At least it looks like y'all have some good leads on the root cause though.

Revision history for this message
Steve Langasek (vorlon) wrote :

In the failing case, the argument to le64toh() is a result returned by basename(). Yeah, no guarantee of alignment >= 1 byte when taking a pointer into the middle of a filename.

Revision history for this message
Steve Langasek (vorlon) wrote :

Well, no guarantee of alignment > 1 byte. There is a guarantee of 1 byte alignment.

Martin Pitt (pitti)
Changed in systemd (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Steve Langasek (vorlon) wrote :

I've assembled a test case for systemd, but I can't get it to fail on the armhf porter box or under qemu. Architecture documentation suggests that unaligned 64-bit reads/writes with ldrd/strd are allowed on ARMv7. But an unaligned access is still exactly what SIGBUS is supposed to represent, which leaves it unclear what's happening here.

The crash has been reported to errors.u.c from a variety of devices, running a variety of kernels; including mako, flo, and some devices not running phone kernels.

Attaching the disassembly of siphash24_compressed from the udevd in the archive, for reference.

Revision history for this message
Steve Langasek (vorlon) wrote :

And here is a disassembly of a locally-built siphash24_compress, built using the same toolchain as was used for building systemd, which I have been unable to get to crash with unaligned input.

The code is identical with only differences in the addresses, except for this rather surprising bit at the end:

 nop
-andeq r12, r1, r10, lsr #16
-muleq r1, r4, r9
-andeq r7, r1, lr, lsr #19
+andeq r2, r0, r10, ror #25
+ ; <UNDEFINED> instruction: 0x000017b0
+ldrdeq r1, [r0], -r6
 End of assembler dump.
 (gdb)

Not sure what to make of that. But the added ldrdeq is in the working code, not the code that has crashes reported against it; so it doesn't seem to be relevant.

Revision history for this message
Steve Langasek (vorlon) wrote :

Investigation shows that the test case is triggering kernel fix-ups for unaligned access, as shown by the incrementing counters in /proc/cpu/alignment on cady. I'm investigating to see if the phone kernels have a different default behavior for alignments (SIGBUS vs. fixup).

So I have a valid test case, it just unfortunately won't block the build with a test failure on our buildds because the kernel is fixing up the unaligned access for us. On other architectures we could build-depend on and use prctl to force SIGBUS to be raised, but ARM doesn't support prctl --unaligned.

I have also verified that booting the emulator shows /proc/cpu/alignment configured for signal (4) instead of fixup (2). This despite the fact that CONFIG_ALIGNMENT_TRAP=y is part of the common configuration for Ubuntu ARM kernels (config/config.common.ubuntu), and is definitely configured that way at build and runtime in the linux-goldfish package in the archive.

That means that *something* in the phone stack is changing the value of /proc/cpu/alignment post-boot. But I don't know what this "something" is; there are no matches for /proc/cpu/alignment in /etc on the generic image, nor in /android. Google points at it possibly being an android-inflicted problem. (https://groups.google.com/forum/#!topic/android-kernel/5vl47DgDz7E)

Now, raising signals on unaligned access is a sensible default behavior. But we currently have this exactly backwards, because we are fixing up those unaligned accesses on our build and test infrastructure but have them causing software failures in production! Someone from the phone team should investigate where this wrong setting is coming from and fix it.

Changed in systemd (Ubuntu):
assignee: Martin Pitt (pitti) → Steve Langasek (vorlon)
status: In Progress → Fix Committed
Revision history for this message
Steve Langasek (vorlon) wrote :

Bug #1516331 opened against launchpad, to get the buildd behavior changed to catch issues like this in the future.

Steve Langasek (vorlon)
Changed in android (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Martin Pitt (pitti) wrote :

Thanks Steve for pointing out /proc/cpu/alignment! Yesterday I tried to reproduce this on an armhf box without success, but with "echo 4 > /proc/cpu/alignment" this reproduces perfectly well. I'll forward your patch upstream.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 227-2ubuntu2

---------------
systemd (227-2ubuntu2) xenial; urgency=medium

  * debian/patches/siphash24-unaligned.patch: fix siphash24
    implementation to handle unaligned inputs. Closes LP: #1512323.

 -- Steve Langasek <email address hidden> Sat, 14 Nov 2015 23:22:39 +0000

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

I cleaned this up a bit and forwarded to https://github.com/systemd/systemd/pull/1911 .

Without this patch, about a third of the unit tests fail with SIGBUS. With this fix, only ./test-dhcp{,6}-client still SIGBUS, apparently not due to siphash24. I'll take a look at this as well.

Revision history for this message
Martin Pitt (pitti) wrote :

It's still in siphash, due to a similar problem with the out argument:

0x2a02a9e8 in siphash24_finalize (out=0x2a065229 "", state=0xbefff970) at src/basic/siphash24.c:182
182 *(le64_t*)out = htole64(state->v0 ^ state->v1 ^ state->v2 ^ state->v3);

I'll follow up on the upstream PR.

Revision history for this message
Martin Pitt (pitti) wrote :

I fixed the unaligned out parameter now in the upstream PR. 228 is around the corner, so the simplest way would be to let this land upstream and get fixed through a new upstream version, but if this turns out to cause further blocking crashes I can also cherry-pick it.

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1512323] Re: devices on devel-proposed/ubuntu do not boot with systemd 227-2ubuntu1

On Mon, Nov 16, 2015 at 08:03:57AM -0000, Martin Pitt wrote:
> It's still in siphash, due to a similar problem with the out argument:

> 0x2a02a9e8 in siphash24_finalize (out=0x2a065229 "", state=0xbefff970) at src/basic/siphash24.c:182
> 182 *(le64_t*)out = htole64(state->v0 ^ state->v1 ^ state->v2 ^ state->v3);

> I'll follow up on the upstream PR.

That is a bug in siphash24, but the only place where an unaligned 'out'
argument is passed to siphash24_finalize() in practice is the test case, so
that's not urgent to fix.

But certainly, siphash24_finalize() needs to be fixed either by changing the
argument to uint64_t* or by handling unaligned buffers.

Revision history for this message
Martin Pitt (pitti) wrote :

> the only place where an unaligned 'out' argument is passed to siphash24_finalize() in practice is the test case

No, also in networkd. Anyway, fixed upstream in https://github.com/systemd/systemd/commit/dbe81cbd2a9 . A more robust fix for the main issue now landed as well (avoiding the malloc, which wasn't checked and could potentially fail). Upstream master now works fine on both x86 and ARM.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Nov 16, 2015 at 03:49:15PM -0000, Martin Pitt wrote:
> > the only place where an unaligned 'out' argument is passed to
> siphash24_finalize() in practice is the test case

> No, also in networkd.

The code I see here in src/libsystemd-network/sd-dhcp-server.c is:

                        uint64_t hash;
                        siphash24_finalize((uint8_t*)&hash, &state);

which has no alignment issue.

Revision history for this message
Martin Pitt (pitti) wrote :

No, this was more subtle, its the setting of duid->en.id in https://github.com/systemd/systemd/commit/dbe81cbd2a9#diff-893ccaa839a00a7a16a80dbc02631270L54 . This was caught by the two DHCP test cases. That struct uses ((attribute __packed__)) unions with an uint32_t preceeding the "id" field, and has an odd address, so is always unaligned.

Changed in canonical-devices-system-image:
status: Confirmed → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.