Ubuntu
systemd package

Shutdown hangs in md kworker after "Reached target Shutdown."

Bug #1587142 reported by Benoît Thébaudeau on 2016-05-30

106

This bug affects 19 people

Affects		Status	Importance	Assigned to	Milestone
	systemd (Ubuntu)	Confirmed	Critical	Dimitri John Ledkov

Bug Description

I'm booting a fully patched 16.04 from an Intel Rapid Storage Technology enterprise RAID1 volume (ThinkServer TS140 with two SATA ST1000NM0033-9ZM drives, ext4 root partition, no LVM, UEFI mode).

If the RAID volume is recovering or resyncing for whatever reason, then `sudo systemctl reboot` and `sudo systemctl poweroff` work fine (I had to `sudo systemctl --now disable lvm2-lvmetad lvm2-lvmpolld lvm2-monitor` in order to consistently get that). However, once the recovery/resync is complete and clean, the reboot and poweroff commands above hang forever after "Reached target Shutdown.". Note that issuing `sudo swapoff -a` beforehand (suggested in the bug #1464917) does not help.
[EDIT]Actually, the shutdown also hangs from time to time during a resync. But I've never seen it succeed once the resync is complete.[/EDIT]

Then, if the server has been forcibly restarted with the power button, the Intel Matrix Storage Manager indicates a "Normal" status for the RAID1 volume, but Ubuntu then resyncs the volume anyway:

[ 1.223649] md: bind<sda>
[ 1.228426] md: bind<sdb>
[ 1.230030] md: bind<sdb>
[ 1.230738] md: bind<sda>
[ 1.232985] usbcore: registered new interface driver usbhid
[ 1.233494] usbhid: USB HID core driver
[ 1.234022] md: raid1 personality registered for level 1
[ 1.234876] md/raid1:md126: not clean -- starting background reconstruction
[ 1.234956] input: CHESEN USB Keyboard as /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10:1.0/0003:0A81:0101.0001/input/input5
[ 1.236273] md/raid1:md126: active with 2 out of 2 mirrors
[ 1.236797] md126: detected capacity change from 0 to 1000202043392
[ 1.246271] md: md126 switched to read-write mode.
[ 1.246834] md: resync of RAID array md126
[ 1.247325] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[ 1.247503] md126: p1 p2 p3 p4
[ 1.248269] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[ 1.248774] md: using 128k window, over a total of 976759940k.

Note that the pain of "resync upon every (re)boot" cannot even be a bit relieved thanks to bitmaps because mdadm does not support them for IMSM containers:

$ sudo mdadm --grow --bitmap=internal /dev/md126
mdadm: Cannot add bitmaps to sub-arrays yet

I also get this in syslog during boot when the individual drives are detected, but this seems to be harmless:

May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/sbin/mdadm --incremental /dev/sdb --offroot' failed with exit code 1.
May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/lib/udev/hdparm' failed with exit code 1.

May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/sbin/mdadm --incremental /dev/sda --offroot' failed with exit code 1.
May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/lib/udev/hdparm' failed with exit code 1.

During a resync, `sudo sh -c 'echo idle > /sys/block/md126/md/sync_action'` actually stops it as expected, but it restarts immediately though nothing seems to have triggered it:

May 30 18:17:02 wssrv1 kernel: [ 3106.826710] md: md126: resync interrupted.
May 30 18:17:02 wssrv1 kernel: [ 3106.836320] md: checkpointing resync of md126.
May 30 18:17:02 wssrv1 kernel: [ 3106.836623] md: resync of RAID array md126
May 30 18:17:02 wssrv1 kernel: [ 3106.836625] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
May 30 18:17:02 wssrv1 kernel: [ 3106.836626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
May 30 18:17:02 wssrv1 kernel: [ 3106.836627] md: using 128k window, over a total of 976759940k.
May 30 18:17:02 wssrv1 kernel: [ 3106.836628] md: resuming resync of md126 from checkpoint.
May 30 18:17:02 wssrv1 mdadm[982]: RebuildStarted event detected on md device /dev/md/Volume0

I attach screenshots of the hanging shutdown log after a `sudo sh -c 'echo 8 > /proc/sys/kernel/printk'`. The second screenshot shows that the kernel has deadlocked in md_write_start(). Note that `sudo systemctl start debug-shell` is unusable on this machine at this point because Ctrl+Alt+F9 brings tty9 without any keyboard.
[EDIT]But I can still switch back to tty1.[/EDIT]

I have also tried with much lower values for vm.dirty_background_ratio and vm.dirty_ratio, but to no avail.

Linux 4.6.0-040600-generic_4.6.0-040600.201605151930_amd64 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/ did not help either.

More information below:

$ lsb_release -rd
Description: Ubuntu 16.04 LTS
Release: 16.04

$ uname -a
Linux wssrv1 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ apt-cache policy systemd
systemd:
  Installed: 229-4ubuntu6
  Candidate: 229-4ubuntu6
  Version table:
*** 229-4ubuntu6 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     229-4ubuntu4 500
        500 http://fr.archive.ubuntu.com/ubuntu xenial/main amd64 Packages

$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md126 : active raid1 sda[1] sdb[0]
976759808 blocks super external:/md127/0 [2/2] [UU]
[>....................] resync = 3.3% (32651584/976759940) finish=85.9min speed=183164K/sec

md127 : inactive sdb[1](S) sda[0](S)
5288 blocks super external:imsm

unused devices: <none>

$ sudo mdadm -D /dev/md127
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2

UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
Member Arrays : /dev/md/Volume0

Number Major Minor RaidDevice

0 8 0 - /dev/sda
1 8 16 - /dev/sdb

$ sudo mdadm -D /dev/md126
/dev/md126:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 976759808 (931.51 GiB 1000.20 GB)
  Used Dev Size : 976759940 (931.51 GiB 1000.20 GB)
   Raid Devices : 2
  Total Devices : 2

State : clean, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Resync Status : 5% complete

           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
    Number Major Minor RaidDevice State
       1 8 0 0 active sync /dev/sda
       0 8 16 1 active sync /dev/sdb

$ sudo mdadm -E /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 92b6e3e4
         Family : 92b6e3e4
     Generation : 00000075
     Attributes : All supported
           UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
       Checksum : 5ad6e3c8 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk00 Serial : Z1W50P5E
          State : active
             Id : 00000000
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

[Volume0]:
           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [UU]
    Failed disk : none
      This Slot : 0
     Array Size : 1953519616 (931.51 GiB 1000.20 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 7630936
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : repair
      Map State : normal <-- normal
     Checkpoint : 201165 (512)
    Dirty State : dirty

  Disk01 Serial : Z1W519DN
          State : active
             Id : 00000001
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

$ sudo mdadm -E /dev/sdb
/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 92b6e3e4
         Family : 92b6e3e4
     Generation : 00000075
     Attributes : All supported
           UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
       Checksum : 5ad6e3c8 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : Z1W519DN
          State : active
             Id : 00000001
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

[Volume0]:
           UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [UU]
    Failed disk : none
      This Slot : 1
     Array Size : 1953519616 (931.51 GiB 1000.20 GB)
   Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
  Sector Offset : 0
    Num Stripes : 7630936
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : repair
      Map State : normal <-- normal
     Checkpoint : 201165 (512)
    Dirty State : dirty

  Disk00 Serial : Z1W50P5E
          State : active
             Id : 00000000
    Usable Size : 1953519880 (931.51 GiB 1000.20 GB)

See original description

Tags:

Revision history for this message

Benoît Thébaudeau (btheb) wrote on 2016-05-30:

shutdown-log-part1.png Edit (54.0 KiB, image/png)

Revision history for this message

Benoît Thébaudeau (btheb) wrote on 2016-05-30:

shutdown-log-part2.png Edit (71.2 KiB, image/png)

Revision history for this message

Benoît Thébaudeau (btheb) wrote on 2016-05-31:

The shutdown now also hangs during a resync. So it behaves inconsistently. I update the subject and the description to reflect this.

summary:

- Reboot hangs once RAID1 resynced
+ Shutdown hangs in md kworker after "Reached target Shutdown."

Benoît Thébaudeau (btheb) on 2016-05-31

description:

updated

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-09-27:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status:	New → Confirmed

Alberto Salvia Novella (es20490446e) on 2016-09-27

Changed in systemd (Ubuntu):
importance:	Undecided → High

Revision history for this message

Sergio Callegari (callegar) wrote on 2016-09-29:

Possibly also related to 1320402 (as the unnecessary resync probably follows the incorrect stopping of the array).

Revision history for this message

Benoît Thébaudeau (btheb) wrote on 2016-10-26:

reboot-md-dyn-debug-patch-4.8.4.png Edit (56.7 KiB, image/png)

I still get the same issues with a fully updated Ubuntu Server 16.04.1 LTS (Linux 4.4.0-45-generic).

The patch from https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1320402/comments/13 does not seem to have any effect on these issues.

I have also tried the latest mainline kernel build from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D , as suggested in https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1320402 :
Linux wssrv1 4.8.4-040804-generic #201610220733 SMP Sat Oct 22 11:35:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
With or without the patch indicated above, this kernel seems to fix the spurious verify issue (so far...), but not the hanging reboot and poweroff. This is an unmaintained kernel intended for debug purposes and not for production anyway.

I attach a hanging reboot log with Linux 4.8.4-040804-generic and the patch indicated above, with dynamic debug enabled in drivers/md/*. It hangs when remounting '/' read-only, still in the md kworker.

Then, I tried to install upstart-sysv to see if getting rid of systemd helps, and it does. With upstart-sysv, all the issues described here seem to be fixed with Linux 4.4.0-45-generic and no patches. This seems to confirm that these issues are caused by systemd. However, I don't know how reliable and tested upstart-sysv is on Ubuntu Server 16.04.1 LTS, so this might possibly be an issue for a production server. I would prefer to keep working with systemd.

Revision history for this message

Benoît Thébaudeau (btheb) wrote on 2016-10-27:

With upstart-sysv, I have now observed a spurious RAID1 verify following a reboot after an unrelated apt upgrade (libc stuff). It's the only spurious RAID1 verify that I have observed so far with upstart-sysv. So upstart-sysv is not perfect, but still better than systemd-sysv regarding these issues.

Also, after one reboot, I got my ext4 rootfs mounted ro without any apparent reason (no errors in the logs). Everything came back to normal after another reboot.

I think I will try Debian, CentOS, and maybe Windows.

Revision history for this message

Matt Wear (wearmg) wrote on 2017-01-14:

I have a similar problem with a super micro X10DRH system board. If I have the intel sata controller set to ACHI and install a fresh instance of 16.04 the server works without issue. If I have the intel sata controller in RSTe RAID mode with a 2 drive RAID1 configuration the server will hang on reboot at the 'Reached target shutdown" message. Are there any logs that I can provide to assist in troubleshooting?
Thanks,
-Matt

Revision history for this message

Michael Cain (cainmp) wrote on 2017-02-18:

Confirming this same issue for me on a clean install of Ubuntu 16.04.2 X64 server on a HP 8300 Elite SFF pc. Anything I can collect to troubleshoot?

Dimitri John Ledkov (xnox) on 2017-03-16

Changed in systemd (Ubuntu):
importance:	High → Critical
assignee:	nobody → Dimitri John Ledkov (xnox)

Revision history for this message

Jens (hoerbie) wrote on 2017-05-17:

#10

Same problem with a SuperMicro Board X11SSH-F with Intel C236 Chipset and two sata disks.

Installing the server with bios sata mode "RAID" and a RAID1 configured in the Intel Raid bios (Ctrl+I), on a fresh Ubuntu install and also on updated Ubuntu 16.04.2 the server doesn't restart or shutdown, it hangs at "Reached target shutdown".

Installing the server with bios sata mode "AHCI", so without configuring something in the Intel Raid bios, but configuring a RAID1 only direct at Ubuntu as mdraid, restart and shutdown work fine with fresh install and also with all updates to 16.04.2 as expected.

May I ask, if this bug will be solved in the future, and if there is a timeline? We planned to buy a lot of this servers, install Ubuntu and put them in a data centre miley away, but we can't wait another year.

It someone in Germany really is able and willing to solve this bug, I would be willing to lend our server for some weeks.

Regards, Jens

Revision history for this message

Jim Nijkamp (jim-pa1jim) wrote on 2017-07-31:

#11

Confirm on this bug with hardware:

HP Z230, Intel Raid bios, Ubuntu 16.04.2.

Revision history for this message

Roman Ledovskiy (roman.l) wrote on 2017-08-01:

#12

Confirm this on:
Supermicro X10DAi board and ASUS ESC4000 G3 server
Ubuntu 16.04.2 LTS

Revision history for this message

Adam Lansky (ad4mcz) wrote on 2017-08-26:

#13

Confirm on HPE ML10 gen9.
Ubuntu 16.04.2 LTS
Intel SATA RAID - RAID 1. Intel SSD.

Revision history for this message

max moro (mm101) wrote on 2017-09-07:

#14

confirmed on
HP ProLiant ML10 Gen9, Xeon E3-1225 v5
Intel Raid Bios set to RAID1 (4 drives, 2xraid1)

Shutdowen/reboot hangs (kworkers) and Raid is rebuilt on every "reset"

Linux w11 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description: Ubuntu 16.04.3 LTS
Release: 16.04

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md124 : active raid1 sdc[1] sdd[0]
2930264064 blocks super external:/md125/0 [2/2] [UU]
[=====>...............] resync = 27.1% (795895744/2930264196) finish=190.3min speed=186838K/sec

md125 : inactive sdd[1](S) sdc[0](S)
4776 blocks super external:imsm

md126 : active raid1 sda[1] sdb[0]
976759808 blocks super external:/md127/0 [2/2] [UU]
[===============>.....] resync = 75.0% (733104704/976759940) finish=28.2min speed=143955K/sec

md127 : inactive sda[1](S) sdb[0](S)
5288 blocks super external:imsm

unused devices: <none>

Revision history for this message

max moro (mm101) wrote on 2017-09-07:

#15

screenshot Edit (2.5 MiB, image/png)

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-09-08:

#16

This very much looks like this ancient issue in RedHat:

https://bugzilla.redhat.com/show_bug.cgi?id=752593

It got fixed at some point in 2012 …

https://bugzilla.redhat.com/show_bug.cgi?id=785739

Is Ubuntu still running mdmon from inside the root fs and killing it before r/o remount?

I am now pondering what to do with a set of file servers running 16.04 and root on an IMSM RAID. I got a number of systems with similar hardware running CentOS 7 just fine with no issue on reboot.

I see elaborate handling of mdmon in systemd units on CentOS … including the --offroot command line parameter that I don't see documented anywhere. KillMode=none might also play a role in the mdmon@.service file.

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-09-11:

#17

So I got a simple fix for being able to reboot again:

```
udpate-rc.d mdadm disable
```

With that, the mdmon instance from the initrd persists (with @sbin/mdmon as argv[0]) and is not killed by systemd. So systemd-reboot does not hang. But: Since mdmon is not killed at all now, the array always gets a fresh resync on each boot.

We really, really need to implement the return-to-initrd practice to be able to bring the system down in the proper reverse order of bringing it up.

(So initrd filesystem has to be kept in memory for the whole time … wasting some memory with the huge initrds of today's time. It would also be nice if mdmon was just part of the kernel as it should be.)

Revision history for this message

max moro (mm101) wrote on 2017-09-11:

#18

... interesting, but the pain is the new resync ... (writing 4(8) TB on each "hanging" reboot, ehm ...)

I used SysReq to sync before power-off

echo 1 > /proc/sys/kernel/sysrq

ALT + "DRUCK" ( or "Print Screen") + R E I S U O

as a workaround. as I have to wait for the sync, I can't report now.
of course that's no fun @ for remote machines.

ps. the servers were on centos before, I did not have any probs there ?!

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-09-12:

#19

Well, of course effects are a pain. The sync is only 256G SSDs im my case.

Main point is that the fix is known since many years and should hopefully be quickly adaptible to Ubuntu. Or not … if it is really necessary to switch to a return to the initrd like Fedora.

I wonder if 17.04 has this fixed. Can someone comment on that? Is the proper initrd buildup/teardown implemented there? I cannot test this on my servers as they are needed in production.

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-09-13:

#20

Is anyone working on this? I see that the bug is assigned, but apart from that only messages from affected users.

Revision history for this message

max moro (mm101) wrote on 2017-09-14:

#21

Shutting down with a "R E S U I O" Workaround
was ok. The Raid was not rebuilt.
m.

Revision history for this message

max moro (mm101) wrote on 2017-09-16:

#22

a BIOS reconfig. from "RAID" to "AHCI" did not change anything.
m.

Revision history for this message

max moro (mm101) wrote on 2017-09-23:

#23

as I don't want to rebuild now again, did anyone try
eg.

chmod +x ~/myshutdown.sh
sudo ~/myshutdown.sh

# with

echo r > /proc/sysrq-trigger
echo e > /proc/sysrq-trigger
echo s > /proc/sysrq-trigger
echo i > /proc/sysrq-trigger

echo u > /proc/sysrq-trigger
echo o > /proc/sysrq-trigger

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-09-27:

#24

Can we get any reaction from Ubuntu on this? Is the needed reworking of the initrd about to happen for the LTS or is this a WONTFIX and root-on-Intel-Matrix-RAID is simply not supported? It is clear what has to happen to make things work again with systemd and mdadm/mdmon. Will it happen?

Revision history for this message

Marek (cmarcox) wrote on 2017-09-29:

#25

I would just like to inform you that this bug appear since last CentOS 7.4 update, 7.3 was ok.
https://bugs.centos.org/view.php?id=13916

Maybe it should be reported to Systemd Issues ?

Revision history for this message

Dr. Thomas Orgis (drthor) wrote on 2017-10-04:

#26

Oh … this is _another_ bug. We are dealing here with the situation that mdmon controlling rootfs on RAID is not handled at all with Ubuntu initrd, while this new CentOS issue is a bug in said handling in the CentOS initrd …

The reference is valuable nevertheless … not least because I have systems about to be upgraded to CentOS 7.4 that might get hit by this bug!

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2017-10-23:

#27

Hello,

as part of https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1722491 there is currently mdadm update available from xenial-proposed and zesty-proposed that might resolve this issue.

To test that solution please perform the following:

1) Install mdadm from xenial-proposed/zesty-proposed
- See https://wiki.ubuntu.com/Testing/EnableProposed
- Or download & install packages from
xenial https://launchpad.net/ubuntu/+source/mdadm/3.4-4ubuntu0.1/+build/13596415
zesty https://launchpad.net/ubuntu/+source/mdadm/3.3-2ubuntu7.5/+build/13596431

2) $ sudo apt install dracut-core

3) $ sudo systemctl enable mdadm-shutdown.service

4) $ sudo systemctl start mdadm-shutdown.service

After this the expectation is for shutdown/reboots to perform clean a shutdown, maintaining the raid array in a synced state, such that it comes up clean.

Please let me know if above resolves shutdown/reboot issues for you.

Regards,

Dimitri.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2017-10-23:

#28

Just a quick clarification, first boot/shutdown will still not be clean, but subsequent ones (those that are booted with the updated mdadm package) should be clean.

Revision history for this message

John Center (john-center) wrote on 2017-10-24:

#29

I added a comment about this new fix on https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1608495. One thing I wanted to add here, for the first time shutdown completed successfully. In fact, I was a little startled when it happened. :-) It powered off at the end of shutdown, instead of hanging at the "Reached target Shutdown" message.

One question: Does this mean that Ubuntu 18.04 will use mdadm by default during installation instead of dmraid? I really hope so. It's a real pain to do it manually.

Revision history for this message

Jens (hoerbie) wrote on 2018-02-22:

#30

I followed the steps from #27 and everything works fine now on 16.04.3 with a SuperMicro Board X11SSH-F with Intel C236 Chipsets RST and two sata disks as Raid 1.

I can also confirm #28, the first reboot still needed a hard reset and a rebuild of the Raid 1, but after this, every (>10) shutdown and reboot was OK.

And the solution still is alive after an update on the newest 4.4.0-116 kernel today.

Sorry for testing delay, but after missing a solution for too long time, we installed our servers different way, only this week I got a new testing machine until 18.04 is out.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2018-02-22:

#31

john-center - I believe mdadm is used by default, instead of dmraid. However, I'm not sure if we do correctly activate Intel Raids with mdadm. I have not done an install with either 16.04 or 18.04. Or I guess at least start the install up to the partitioning screen. My expectation is, if one has raid arrays pre-setup in BIOS (ctrl+I on boot) already, they should show up as assembled and offered to be autopartitioned. With both 16.04 and 18.04.

Revision history for this message

John Center (john-center) wrote on 2018-02-23: Re: [Bug 1587142] Re: Shutdown hangs in md kworker after "Reached target Shutdown."

#32

Download full text (10.3 KiB)

Hi Dimitri,

> On Feb 22, 2018, at 8:32 AM, Dimitri John Ledkov <email address hidden> wrote:
>
> john-center - I believe mdadm is used by default, instead of dmraid.
> However, I'm not sure if we do correctly activate Intel Raids with
> mdadm. I have not done an install with either 16.04 or 18.04. Or I guess
> at least start the install up to the partitioning screen. My expectation
> is, if one has raid arrays pre-setup in BIOS (ctrl+I on boot) already,
> they should show up as assembled and offered to be autopartitioned. With
> both 16.04 and 18.04.
>
I know with 16.04 it wasn’t used by default. I had to do a lot of manipulation to set up the raid array before I could do the install. It used dmraid once it detected the imsm raid. I removed dmraid completely, then installed mdadm, assembled the array & ran the installation program again. I was hoping that mdadm would just do it all instead.

-John

Hi Dimitri,

> On Feb 22, 2018, at 8:32 AM, Dimitri John Ledkov <launchpad@surgut.co.uk> wrote:
> 
> john-center - I believe mdadm is used by default, instead of dmraid.
> However, I'm not sure if we do correctly activate Intel Raids with
> mdadm. I have not done an install with either 16.04 or 18.04. Or I guess
> at least start the install up to the partitioning screen. My expectation
> is, if one has raid arrays pre-setup in BIOS (ctrl+I on boot) already,
> they should show up as assembled and offered to be autopartitioned. With
> both 16.04 and 18.04.
> 
I know with 16.04 it wasn’t used by default. I had to do a lot of manipulation to set up the raid array before I could do the install. It used dmraid once it detected the imsm raid. I removed dmraid completely, then installed mdadm, assembled the array & ran the installation program again. I was hoping that mdadm would just do it all instead.

-John

> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1587142
> 
> Title:
>  Shutdown hangs in md kworker after "Reached target Shutdown."
> 
> Status in systemd package in Ubuntu:
>  Confirmed
> 
> Bug description:
>  I'm booting a fully patched 16.04 from an Intel Rapid Storage
>  Technology enterprise RAID1 volume (ThinkServer TS140 with two SATA
>  ST1000NM0033-9ZM drives, ext4 root partition, no LVM, UEFI mode).
> 
>  If the RAID volume is recovering or resyncing for whatever reason, then `sudo systemctl reboot` and `sudo systemctl poweroff` work fine (I had to `sudo systemctl --now disable lvm2-lvmetad lvm2-lvmpolld lvm2-monitor` in order to consistently get that). However, once the recovery/resync is complete and clean, the reboot and poweroff commands above hang forever after "Reached target Shutdown.". Note that issuing `sudo swapoff -a` beforehand (suggested in the bug #1464917) does not help.
>  [EDIT]Actually, the shutdown also hangs from time to time during a resync. But I've never seen it succeed once the resync is complete.[/EDIT]
> 
>  Then, if the server has been forcibly restarted with the power button,
>  the Intel Matrix Storage Manager indicates a "Normal" status for the
>  RAID1 volume, but Ubuntu then resyncs the volume anyway:
> 
>  [    1.223649] md: bind<sda>
>  [    1.228426] md: bind<sdb>
>  [    1.230030] md: bind<sdb>
>  [    1.230738] md: bind<sda>
>  [    1.232985] usbcore: registered new interface driver usbhid
>  [    1.233494] usbhid: USB HID core driver
>  [    1.234022] md: raid1 personality registered for level 1
>  [    1.234876] md/raid1:md126: not clean -- starting background reconstruction
>  [    1.234956] input: CHESEN USB Keyboard as /devices/pci0000:00/0000:00:14.0/usb3/3-10/3-10:1.0/0003:0A81:0101.0001/input/input5
>  [    1.236273] md/raid1:md126: active with 2 out of 2 mirrors
>  [    1.236797] md126: detected capacity change from 0 to 1000202043392
>  [    1.246271] md: md126 switched to read-write mode.
>  [    1.246834] md: resync of RAID array md126
>  [    1.247325] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>  [    1.247503]  md126: p1 p2 p3 p4
>  [    1.248269] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>  [    1.248774] md: using 128k window, over a total of 976759940k.
> 
>  Note that the pain of "resync upon every (re)boot" cannot even be a
>  bit relieved thanks to bitmaps because mdadm does not support them for
>  IMSM containers:
> 
>  $ sudo mdadm --grow --bitmap=internal /dev/md126
>  mdadm: Cannot add bitmaps to sub-arrays yet
> 
>  I also get this in syslog during boot when the individual drives are
>  detected, but this seems to be harmless:
> 
>  May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/sbin/mdadm --incremental /dev/sdb --offroot' failed with exit code 1.
>  May 30 17:26:07 wssrv1 systemd-udevd[608]: Process '/lib/udev/hdparm' failed with exit code 1.
> 
>  May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/sbin/mdadm --incremental /dev/sda --offroot' failed with exit code 1.
>  May 30 17:26:07 wssrv1 systemd-udevd[606]: Process '/lib/udev/hdparm' failed with exit code 1.
> 
>  During a resync, `sudo sh -c 'echo idle >
>  /sys/block/md126/md/sync_action'` actually stops it as expected, but
>  it restarts immediately though nothing seems to have triggered it:
> 
>  May 30 18:17:02 wssrv1 kernel: [ 3106.826710] md: md126: resync interrupted.
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836320] md: checkpointing resync of md126.
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836623] md: resync of RAID array md126
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836625] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836627] md: using 128k window, over a total of 976759940k.
>  May 30 18:17:02 wssrv1 kernel: [ 3106.836628] md: resuming resync of md126 from checkpoint.
>  May 30 18:17:02 wssrv1 mdadm[982]: RebuildStarted event detected on md device /dev/md/Volume0
> 
>  I attach screenshots of the hanging shutdown log after a `sudo sh -c 'echo 8 > /proc/sys/kernel/printk'`. The second screenshot shows that the kernel has deadlocked in md_write_start(). Note that `sudo systemctl start debug-shell` is unusable on this machine at this point because Ctrl+Alt+F9 brings tty9 without any keyboard.
>  [EDIT]But I can still switch back to tty1.[/EDIT]
> 
>  I have also tried with much lower values for vm.dirty_background_ratio
>  and vm.dirty_ratio, but to no avail.
> 
>  Linux 4.6.0-040600-generic_4.6.0-040600.201605151930_amd64 from
>  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/ did not
>  help either.
> 
>  More information below:
> 
>  $ lsb_release -rd
>  Description:    Ubuntu 16.04 LTS
>  Release:    16.04
> 
>  $ uname -a
>  Linux wssrv1 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
>  $ apt-cache policy systemd
>  systemd:
>    Installed: 229-4ubuntu6
>    Candidate: 229-4ubuntu6
>    Version table:
>   *** 229-4ubuntu6 500
>          500 http://fr.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
>          100 /var/lib/dpkg/status
>       229-4ubuntu4 500
>          500 http://fr.archive.ubuntu.com/ubuntu xenial/main amd64 Packages
> 
>  $ cat /proc/mdstat
>  Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
>  md126 : active raid1 sda[1] sdb[0]
>        976759808 blocks super external:/md127/0 [2/2] [UU]
>        [>....................]  resync =  3.3% (32651584/976759940) finish=85.9min speed=183164K/sec
> 
>  md127 : inactive sdb[1](S) sda[0](S)
>        5288 blocks super external:imsm
> 
>  unused devices: <none>
> 
>  $ sudo mdadm -D /dev/md127
>  /dev/md127:
>          Version : imsm
>       Raid Level : container
>    Total Devices : 2
> 
>  Working Devices : 2
> 
>             UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
>    Member Arrays : /dev/md/Volume0
> 
>      Number   Major   Minor   RaidDevice
> 
>         0       8        0        -        /dev/sda
>         1       8       16        -        /dev/sdb
> 
>  $ sudo mdadm -D /dev/md126
>  /dev/md126:
>        Container : /dev/md/imsm0, member 0
>       Raid Level : raid1
>       Array Size : 976759808 (931.51 GiB 1000.20 GB)
>    Used Dev Size : 976759940 (931.51 GiB 1000.20 GB)
>     Raid Devices : 2
>    Total Devices : 2
> 
>            State : clean, resyncing
>   Active Devices : 2
>  Working Devices : 2
>   Failed Devices : 0
>    Spare Devices : 0
> 
>    Resync Status : 5% complete
> 
>             UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
>      Number   Major   Minor   RaidDevice State
>         1       8        0        0      active sync   /dev/sda
>         0       8       16        1      active sync   /dev/sdb
> 
>  $ sudo mdadm -E /dev/sda
>  /dev/sda:
>            Magic : Intel Raid ISM Cfg Sig.
>          Version : 1.1.00
>      Orig Family : 92b6e3e4
>           Family : 92b6e3e4
>       Generation : 00000075
>       Attributes : All supported
>             UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
>         Checksum : 5ad6e3c8 correct
>      MPB Sectors : 2
>            Disks : 2
>     RAID Devices : 1
> 
>    Disk00 Serial : Z1W50P5E
>            State : active
>               Id : 00000000
>      Usable Size : 1953519880 (931.51 GiB 1000.20 GB)
> 
>  [Volume0]:
>             UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
>       RAID Level : 1 <-- 1
>          Members : 2 <-- 2
>            Slots : [UU] <-- [UU]
>      Failed disk : none
>        This Slot : 0
>       Array Size : 1953519616 (931.51 GiB 1000.20 GB)
>     Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
>    Sector Offset : 0
>      Num Stripes : 7630936
>       Chunk Size : 64 KiB <-- 64 KiB
>         Reserved : 0
>    Migrate State : repair
>        Map State : normal <-- normal
>       Checkpoint : 201165 (512)
>      Dirty State : dirty
> 
>    Disk01 Serial : Z1W519DN
>            State : active
>               Id : 00000001
>      Usable Size : 1953519880 (931.51 GiB 1000.20 GB)
> 
>  $ sudo mdadm -E /dev/sdb
>  /dev/sdb:
>            Magic : Intel Raid ISM Cfg Sig.
>          Version : 1.1.00
>      Orig Family : 92b6e3e4
>           Family : 92b6e3e4
>       Generation : 00000075
>       Attributes : All supported
>             UUID : e9bb2216:cb1bbc0f:96943390:bb65943c
>         Checksum : 5ad6e3c8 correct
>      MPB Sectors : 2
>            Disks : 2
>     RAID Devices : 1
> 
>    Disk01 Serial : Z1W519DN
>            State : active
>               Id : 00000001
>      Usable Size : 1953519880 (931.51 GiB 1000.20 GB)
> 
>  [Volume0]:
>             UUID : 3d724b1d:ac75cddb:600ac81a:ccdc2090
>       RAID Level : 1 <-- 1
>          Members : 2 <-- 2
>            Slots : [UU] <-- [UU]
>      Failed disk : none
>        This Slot : 1
>       Array Size : 1953519616 (931.51 GiB 1000.20 GB)
>     Per Dev Size : 1953519880 (931.51 GiB 1000.20 GB)
>    Sector Offset : 0
>      Num Stripes : 7630936
>       Chunk Size : 64 KiB <-- 64 KiB
>         Reserved : 0
>    Migrate State : repair
>        Map State : normal <-- normal
>       Checkpoint : 201165 (512)
>      Dirty State : dirty
> 
>    Disk00 Serial : Z1W50P5E
>            State : active
>               Id : 00000000
>      Usable Size : 1953519880 (931.51 GiB 1000.20 GB)
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1587142/+subscriptions

Revision history for this message

Joshua Diamant (joshdi) wrote on 2020-02-07:

#33

I am having this issue using IMSM / VROC 6.2 on Ubuntu 18.04 LTS, Kernel 5.3.0-28, mdadm - v4.1-rc1 - 2018-03-22.

I am running a bcache cache device on one of the RAID 1 IMSM (VROC) arrays. At the very minimum, every reboot the device needs to resync (not shutting down clean).

What other information is required to help us debug this issue?