Bug #1623383 “Some restarts fail due to missing base devices” : Bugs : systemd package : Ubuntu

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-14:

#1

Bad and good boot logs Edit (170.9 KiB, text/plain)

affects:

ubuntu → udev (Ubuntu)

Frank Heimes (fheimes) on 2016-09-14

tags:

added: s390x

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-09-19:

#2

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in udev (Ubuntu):
status:	New → Confirmed

Revision history for this message

Brad Figg (brad-figg) wrote on 2016-09-19:

#3

This issue is now blocking kernel regression testing for these s390 variants.

Changed in udev (Ubuntu):
importance:	Undecided → Critical

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-20:

#4

For testing I enabled "debug" in the kernel command-line since that has some influence on how systemd acts but I found both the LPAR and the zVM guest unreachable this morning which is a sign they did not survive being rebooted.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-20:

#5

zvmguest-failedboot-journal.txt.gz Edit (296.6 KiB, text/plain)

This is the (compressed otherwise it would be 13M) journal from a zVM guest failed boot. One thing that maybe should noted, we have some zfcp disks enabled on the LPAR and zVM guest. That is at least for the zVM guest the same config as we have for the guest running Xenial.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-20:

#6

enc600 is the qeth net device for the zVM guest. There seem to be issues visible with that in the log.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-20:

#7

zvmguest-goodboot-journal.txt.gz Edit (444.2 KiB, text/plain)

Brian Murray (brian-murray) on 2016-09-20

tags:	added: xenial
tags:	added: yakkety removed: xenial

Brian Murray (brian-murray) on 2016-09-20

Changed in udev (Ubuntu):
assignee:	nobody → Dimitri John Ledkov (xnox)

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-22:

#8

Spending more time looking at the two logs, I saw that in the failed boot case the
systemd-modules-load.service is never stopping. The last message there is that the ib_iser module (which is the second of 2) got loaded and thats all.
When the boot succeeds, then the systemd init process receives a SIGCHLD from systemd-modules-load.service.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-22:

#9

I took a dump of another run that ended in the failed state. While not all messages that are in the journal show up in the dmesg log from a dump, there does not seem to be a message about systemd noticing the exit on systemd-modules-load, however I can not see the process listed in the ps output in crash. The ib_iser module is loaded.
So, with the previous info, it feels more like a communication issues between the systemd main process and the loader process. And also not always.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-23:

#10

Still happened, even without any modules getting loaded by that service. Maybe the specific service is a red-herring and its something else which just shows there because the service is in the critical path. For making things more clear, here the messages related to it from boot -1 (a failed one) and 0 (a successful one):

ubuntu@s2lp4:~$ journalctl -b-1|grep modules-load
Sep 22 08:13:54 s2lp4 systemd[582]: systemd-modules-load.service: Executing: /lib/systemd/systemd-modules-load
Sep 22 08:13:54 s2lp4 systemd-modules-load[582]: apply: /etc/modules-load.d/modules.conf
Sep 22 13:15:44 s2lp4 systemd[1]: systemd-modules-load.service: Failed to send unit remove signal for systemd-modules-load.service: Transport endpoint is not connected
Sep 22 13:15:44 s2lp4 systemd[1]: systemd-modules-load.service: Changed dead -> exited
Sep 22 13:15:45 s2lp4 systemd[1]: systemd-modules-load.service: Failed to send unit change signal for systemd-modules-load.service: Transport endpoint is not connected

ubuntu@s2lp4:~$ journalctl -b0|grep modules-load
Sep 22 23:44:44 s2lp4 systemd[1613]: systemd-modules-load.service: Executing: /lib/systemd/systemd-modules-load
Sep 22 23:44:44 s2lp4 systemd-modules-load[1613]: apply: /etc/modules-load.d/modules.conf
Sep 22 23:44:48 s2lp4 systemd[1]: Got cgroup empty notification for: /system.slice/systemd-modules-load.service
Sep 22 23:44:48 s2lp4 systemd[1]: systemd-modules-load.service: cgroup is empty

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-23:

#11

Adding the setup of the zVM guest that shows the problem in case that helps to re-create the problem:

Memory: 4G, 8 vCPUs

Device Subchan. DevType CU Type Use PIM PAM POM CHPIDs
----------------------------------------------------------------------
0.0.e002 0.0.0000 1732/03 1731/03 yes 80 80 ff 20000000 00000000
0.0.e102 0.0.0001 1732/03 1731/03 yes 80 80 ff 21000000 00000000
0.0.0200 0.0.0002 3390/0c 3990/e9 yes 80 80 ff ff000000 00000000
0.0.0201 0.0.0003 3390/0c 3990/e9 yes 80 80 ff ff000000 00000000
0.0.0009 0.0.0004 0000/00 3215/00 yes 80 80 ff ff000000 00000000
0.0.000c 0.0.0005 0000/00 2540/00 80 80 ff ff000000 00000000
0.0.000d 0.0.0006 0000/00 2540/00 80 80 ff ff000000 00000000
0.0.000e 0.0.0007 0000/00 1403/00 80 80 ff ff000000 00000000
0.0.0600 0.0.0008 1732/01 1731/01 yes 80 80 ff 00000000 00000000
0.0.0601 0.0.0009 1732/01 1731/01 yes 80 80 ff 00000000 00000000
0.0.0602 0.0.000a 1732/01 1731/01 yes 80 80 ff 00000000 00000000

0.0.e002 and e102 are two zfcp adapter leading to one disk via two paths each. The resulting 4 paths are grouped by multipath and make up a lvm2 VG. But nothing is mounted from there by default.

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdd 8:48 0 10G 0 disk
`-36005076306ffd6b60000000000002a03 254:1 0 10G 0 mpath
dasda 94:0 0 6.9G 0 disk
`-dasda1 94:1 0 6.9G 0 part /
sdb 8:16 0 10G 0 disk
`-36005076306ffd6b60000000000002a03 254:1 0 10G 0 mpath
dasdb 94:4 0 20.6G 0 disk
`-dasdb1 94:5 0 20.6G 0 part
`-datavg-home 254:0 0 20.6G 0 lvm /home
sdc 8:32 0 10G 0 disk
`-36005076306ffd6b60000000000002a03 254:1 0 10G 0 mpath
sda 8:0 0 10G 0 disk
`-36005076306ffd6b60000000000002a03 254:1 0 10G 0 mpath

Adding the setup of the zVM guest that shows the problem in case that helps to re-create the problem:

Memory: 4G, 8 vCPUs

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.e002 0.0.0000  1732/03 1731/03 yes  80  80  ff   20000000 00000000 
0.0.e102 0.0.0001  1732/03 1731/03 yes  80  80  ff   21000000 00000000 
0.0.0200 0.0.0002  3390/0c 3990/e9 yes  80  80  ff   ff000000 00000000 
0.0.0201 0.0.0003  3390/0c 3990/e9 yes  80  80  ff   ff000000 00000000 
0.0.0009 0.0.0004  0000/00 3215/00 yes  80  80  ff   ff000000 00000000 
0.0.000c 0.0.0005  0000/00 2540/00      80  80  ff   ff000000 00000000 
0.0.000d 0.0.0006  0000/00 2540/00      80  80  ff   ff000000 00000000 
0.0.000e 0.0.0007  0000/00 1403/00      80  80  ff   ff000000 00000000 
0.0.0600 0.0.0008  1732/01 1731/01 yes  80  80  ff   00000000 00000000 
0.0.0601 0.0.0009  1732/01 1731/01 yes  80  80  ff   00000000 00000000 
0.0.0602 0.0.000a  1732/01 1731/01 yes  80  80  ff   00000000 00000000

0.0.e002 and e102 are two zfcp adapter leading to one disk via two paths each. The resulting 4 paths are grouped by multipath and make up a lvm2 VG. But nothing is mounted from there by default.

lsblk
NAME                                MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sdd                                   8:48   0   10G  0 disk  
`-36005076306ffd6b60000000000002a03 254:1    0   10G  0 mpath 
dasda                                94:0    0  6.9G  0 disk  
`-dasda1                             94:1    0  6.9G  0 part  /
sdb                                   8:16   0   10G  0 disk  
`-36005076306ffd6b60000000000002a03 254:1    0   10G  0 mpath 
dasdb                                94:4    0 20.6G  0 disk  
`-dasdb1                             94:5    0 20.6G  0 part  
  `-datavg-home                     254:0    0 20.6G  0 lvm   /home
sdc                                   8:32   0   10G  0 disk  
`-36005076306ffd6b60000000000002a03 254:1    0   10G  0 mpath 
sda                                   8:0    0   10G  0 disk  
`-36005076306ffd6b60000000000002a03 254:1    0   10G  0 mpath

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2016-09-23:

#12

Sep 22 13:15:45 s2lp4 systemd[1]: systemd-modules-load.service: Failed to send unit change signal for systemd-modules-load.service: Transport endpoint is not connected

Pretty much means that pid1 stopped listening to /run/systemd/private. Does not receive exited notification, does not mark systemd-modules-load as started and everything fails from then on.

Output of: lvs & vgs would be helpful too.

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-23:

#13

vgs:
  VG #PV #LV #SN Attr VSize VFree
  datavg 1 1 0 wz--n- 20.63g 0
  zfcpvg 1 0 0 wz--n- 10.00g 10.00g

lvs:
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
home datavg -wi-ao---- 20.63g

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-23:

#14

Changing the type of systemd-modules-load from oneshot to simple might work around the problem. At least this output looks like it would have been a hang:

* systemd-modules-load.service - Load Kernel Modules
   Loaded: loaded (/lib/systemd/system/systemd-modules-load.service; static; vendor preset: enabled)
   Active: active (exited) since Fri 2016-09-23 09:57:49 BST; 5h 12min ago
     Docs: man:systemd-modules-load.service(8)
           man:modules-load.d(5)
Main PID: 546 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/systemd-modules-load.service

Sep 23 09:57:57 kernel02 systemd[1]: systemd-modules-load.service: Failed to send unit remove signal for systemd-modules-load.service: Transport endpoint is not connected
Sep 23 09:57:57 kernel02 systemd[1]: systemd-modules-load.service: Changed dead -> exited

I can ssh into the system and netstat -an shows something listening on /run/systemd/provate

netstat -an|grep private
unix 2 [ ACC ] STREAM LISTENING 31143 /run/user/1000/systemd/private
unix 2 [ ACC ] STREAM LISTENING 9536 /run/systemd/private

Revision history for this message

Stefan Bader (smb) wrote on 2016-09-23:

#15

Is that normal? However I also see those disconnects on the journal of a good boot...

journalctl -b0|grep private
Sep 23 09:57:51 kernel02 systemd-tmpfiles[1936]: Entry "/tmp/systemd-private-4f4f7561b59a4b06a720204adeaff628-*" does not match any include prefix, skipping.
Sep 23 09:57:51 kernel02 systemd-tmpfiles[1936]: Entry "/tmp/systemd-private-4f4f7561b59a4b06a720204adeaff628-*/tmp" does not match any include prefix, skipping.
Sep 23 09:57:51 kernel02 systemd-tmpfiles[1936]: Entry "/var/tmp/systemd-private-4f4f7561b59a4b06a720204adeaff628-*" does not match any include prefix, skipping.
Sep 23 09:57:51 kernel02 systemd-tmpfiles[1936]: Entry "/var/tmp/systemd-private-4f4f7561b59a4b06a720204adeaff628-*/tmp" does not match any include prefix, skipping.
Sep 23 09:57:54 kernel02 systemd[1]: Accepted new private connection.
Sep 23 09:57:54 kernel02 systemd[1]: Accepted new private connection.
Sep 23 09:57:54 kernel02 systemd[1]: Got disconnect on private connection.
Sep 23 09:57:54 kernel02 systemd[1]: Accepted new private connection.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2016-09-23:

#16

#15 -> those things look fine.

Dimitri John Ledkov (xnox) on 2016-10-05

affects:

udev (Ubuntu) → systemd (Ubuntu)

Revision history for this message

Frank Heimes (fheimes) wrote on 2017-02-23:

#17

Looks like this can happen on VM guests and KVM vms, too.
(could be somehow related to zfs ...)

Revision history for this message

Frank Heimes (fheimes) wrote on 2017-03-02:

#18

After further investigation the issue could be related to udev.

Would it be possible to start the systems with kernel parameters udev.log_priority and rd.udev.log_priority set to debug and share again the logs from a failed system?

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2017-03-07:

#19

Hello Frank, I have set udev.log_priority and rd.udev.log_priority to debug on kernel03 with smb's help. Let's wait for this happens.

Revision history for this message

Ian Thompson (ithompson) wrote on 2017-03-13:

#20

I have the same problem with two Dell PowerEdge FC630.
Ubuntu 16.04.2 LTS/ 4.4.0-66-generic

Shutdown/reboot stalls at different points each attempt.

An IBM System x3650 M4 in the same rack, at the same software level, reboots cleanly every time.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2017-03-13: Re: [Bug 1623383] Re: Some restarts fail due to missing base devices

#21

On 13 March 2017 at 16:58, Ian Thompson <email address hidden> wrote:
> I have the same problem with two Dell PowerEdge FC630.
> Ubuntu 16.04.2 LTS/ 4.4.0-66-generic
>
> Shutdown/reboot stalls at different points each attempt.
>
> An IBM System x3650 M4 in the same rack, at the same software level,
> reboots cleanly every time.
>

Could you please open a new bug report using ubuntu-bug tool and
include systemd journal which includes failed boots?
Just in case if this is not the same issue. As so far we have only
reproduced this on s390x architecture.
Please enable persistent journal first by doing $ sudo mkdir -p /var/log/journal

--
Regards,

Dimitri.

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2017-03-20:

#22

s2lp6g003.log Edit (45.7 KiB, text/plain)

Node s2lp6g003 (s390x.zKVM) boot to emergency mode. Please find attachment for the journalctl log.

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2017-03-21:

#23

kernel02.log Edit (5.7 MiB, text/plain)

journalctl log for kernel02 (s390x.zVM), which drops to emergency mode as well

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2017-05-17:

#24

Hi, we’re trying to explore different approaches to investigate, resolve and/or work around this issue, but regrettably there do not appear to be any simple solutions.

Would it be possible to record the failure rate during the testing for the next SRU cycle? Perhaps something simple like “5 failures in 50 redeployments over a 2 week period”?

Many thanks.

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2017-07-12:

#25

Here is the number from the beginning of this SRU cycle till today (since Jul. 3)

We have 6 jobs for each node, which requires at least 6 re-deployments (note that it's just a reboot for s390x, since we don't re-deploy them).

The failures here means how many times we need to reboot it manually.
The total jobs count on jenkins is larger than the sum of failures + 6, as one failure may cause all the other job to fail and need to be restarted.

s2lp4 (Ubuntu on LPAR) - 1 failures, jobs count: 17
kernel02 (zVM) - 2 failures, jobs count: 16
s2lp6g003 (zKVM) - 6 failures, jobs count: 9
kernel03 (zKVM) - 7 failures, jobs count: 21

kernel03 is still running. The number of failures and jobs count will change.

Revision history for this message

Frank Heimes (fheimes) wrote on 2017-07-12:

#26

Okay suggestion is to:
- Update zKVM (which is based on IBM KVM on z) to the latest and last release 1.1.2 FP4.
1.1.2 FP4 is the only one that is still supported. I can help on that.
- It might also be worth trying to limit the devices on the LPAR with the help of cio_ignore
and see if that make a difference - in case of a potential race-conditions.

Dan Streetman (ddstreet) on 2021-06-30

Changed in systemd (Ubuntu):
status:	Confirmed → Won't Fix

Ubuntu
systemd package

Some restarts fail due to missing base devices

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntusystemd package

Some restarts fail due to missing base devices

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
systemd package