The created partitions can be seen in the system only partly when partitioning 256 FCP devices with multipathing

Bug #1571707 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Medium
Unassigned
multipath-tools (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

== Comment: #0 - IRINA YASHINA - 2016-04-15 07:14:54 ==
During the Limit Testing the script limit.tar.gz was used for partitioning 256 FCP devices,
a partition per device should be created ("./limit.sh fcp_partition").

The created partitions can be seen in the system only partly, e.g. 89 instead of 256,
in /proc/partitions, in /dev/disk/by-path.

For example, in /dev/disk/by-path for LUN 40244000 there is no partition sddq1:

# ll /dev/disk/by-path | grep 40244000
lrwxrwxrwx 1 root root 10 Apr 14 22:03 ccw-0.0.1800-fc-0x5005076303135401-lun-0x4024400000000000 -> ../../sddq
lrwxrwxrwx 1 root root 10 Apr 14 20:58 ccw-0.0.1800-fc-0x5005076303535401-lun-0x4024400000000000 -> ../../sdnv
lrwxrwxrwx 1 root root 10 Apr 14 20:58 ccw-0.0.1840-fc-0x5005076303135401-lun-0x4024400000000000 -> ../../sdxo
lrwxrwxrwx 1 root root 11 Apr 14 20:58 ccw-0.0.1840-fc-0x5005076303535401-lun-0x4024400000000000 -> ../../sdahh

It also can't be seen in the output of "ll /proc/partitions":
 . . .
  71 128 1048576 sddq
  71 144 1048576 sddr
  71 160 1048576 sdds
  71 161 1047552 sdds1
  71 176 1048576 sddt
  71 192 1048576 sddu
  71 193 1047552 sddu1
 . . .

But fdisk shows that the partition sddq1 was created:

root@s35lp12:~/limit# fdisk -l | grep /sddq
Disk /dev/sddq: 1 GiB, 1073741824 bytes, 2097152 sectors
/dev/sddq1 2048 2097151 2095104 1023M 83 Linux

The output of the commands 'cat /proc/partitions' and 'fdisk -l' are attached.

The number of seen partitions in the system after partitioning 256 FCP devices is differ from time to time.

With partitioning 30 FCP devices all the created partitions are reflected in the system correctly.

== Comment: #3 - IRINA YASHINA - 2016-04-18 04:03:40 ==
Additional info:

investigations show that the script hangs for 256 FCP disks.
Incorrect and different number of devices in /proc/partitions and /dev/disk/by-path
is the result of this hanging.
The system may be repaired by using partx and kpartx.

This problem is connected with parallel execution of commands in the script.
When parallel execution is removed from the script for partitioning FCP disks,
then there is no hang and all the data in /proc/partitions and /dev/disk/by-path are correct.

Revision history for this message
bugproxy (bugproxy) wrote : Archive used for the Limit Testing

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-140368 severity-medium targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : Output of the command 'cat /proc/partitions'

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Output of the command 'fdisk -l'

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Kevin W. Rudd (kevinr)
affects: ubuntu → multipath-tools (Ubuntu)
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

When things hang-up, are there any errors in dmesg? E.g. udev missing events, and/or kernel failing to re-read partition tables? If fdisk claims to have created the partition, yet kernel has no knowledge the previously mentioned conditions may have occurred. Also do the "created by fdisk yet missing" partitions show up after reboot, and activation of those devices?

bugproxy (bugproxy)
tags: added: targetmilestone-inin16041
removed: targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : dmesg when hanging

------- Comment on attachment From <email address hidden> 2016-04-22 11:29 EDT-------

I didn't see any errors in dmesg please see attachment.

After reboot and setting fcp online ('./limit.sh fcp_online') some "created by fdisk yet missing" partitions
show up, but others disappeared.
The whole number of partitions which are seen in in /proc/partitions and in /dev/disk/by-path
decreased from 86 (before reboot) to 11 (after reboot).
'fdisk -l' shows 256 partitions.

Please note, without parallel execution there is no hanging and all the data are valid.

dann frazier (dannf)
Changed in ubuntu-z-systems:
importance: Undecided → Medium
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

We have experienced races in bringing up devices and partitions before.

During work on automated preseed I'm not convinced that using echo into /sys/bus/ccw is sufficient race-free action. Have you considered to tweak the script to use chzdev tooling? instead of echoing things in sysfs and/or using chccwdev?

And if using parallel mode, do use active configuration only i.e. chzdev -a -e options.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-05-23 10:45 EDT-------
I tried to use 'chzdev -a' option instead of 'chccwdev' in the script.
This doesn't solve the problem with parallel execution,
the script hangs for 256 FCP disks.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-26 16:43 EDT-------
This bug has been quiet for too long. Do you have any status or updates to share?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-10-27 13:28 EDT-------
(In reply to comment #13)
> I tried to use 'chzdev -a' option instead of 'chccwdev' in the script.
> This doesn't solve the problem with parallel execution,
> the script hangs for 256 FCP disks.

Hi Canonical.

Are there any further recommendations or questions related to the last update regarding using chzdev instead?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-13 09:29 EDT-------
New Target 17.04...after discussion with Canonical

tags: added: targetmilestone-inin1704
removed: targetmilestone-inin16041
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

hws, there is no work done on this ticket, and there is no target nor priority for this issue to be resolved for 17.04. The request in question is most likely an upstream bug deeper in the stack e.g. either kernel, or multipath tools, or combination of thereof. I do hope it is architecture independent issue, and not bugs inside e.g. zfcp drivers. Please remove any targets. The bug is also not driven by any end-customer systems or software, but is generated/discovered via testing for the sake of testing and thus shouldn't be a priority for your teams to fix either.

Changed in multipath-tools (Ubuntu):
milestone: none → later
Szymon Scholz (quomoow)
Changed in ubuntu-z-systems:
status: Triaged → Fix Released
Changed in multipath-tools (Ubuntu):
status: New → Fix Released
Changed in ubuntu-z-systems:
status: Fix Released → Triaged
Changed in multipath-tools (Ubuntu):
status: Fix Released → New
Changed in multipath-tools (Ubuntu):
status: New → Incomplete
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Incomplete
Changed in multipath-tools (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → nobody
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-17 04:03 EDT-------
Canonical, can you please provide an update for this LP, how to proceed here. Many thx advance

Revision history for this message
Frank Heimes (fheimes) wrote :

Whilst the symptoms are clearly demonstrated, the root cause has not been trianged.

This may still be a regression in the kernel & userspace stacks, or not.

It is not clear, how this can be bisected. All Ubuntu releases have been so far affected and internally we do not have access to this many devices - neither on z, or other architectures accessible by a single system.

It would be useful to request IBM QA to rerun this particular testcase on 18.04 LTS to check if things are still just as broken, or have become better or worse. This will at least gives us the trend line for this stress test.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Frank,
do you think a test like [1] would help?
It is not (z)FCP, but you could at least throw a nearly unlimited amount of disks at multipath and check if it is an issue.

[1]: https://git.launchpad.net/ubuntu/+source/multipath-tools/tree/debian/tests/tgtbasedmpaths?h=applied/ubuntu/devel

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-08 11:49 EDT-------
(In reply to comment #24)
> It would be useful to request IBM QA to rerun this particular testcase on
> 18.04 LTS to check if things are still just as broken, or have become better
> or worse. This will at least gives us the trend line for this stress test.

Retested on Ubuntu 16.04.3, kernel 4.4.0-96-generic #119-Ubuntu SMP Tue Sep 12 14:59:56 UTC 2017
-> no problem found with 1024 LUNs, 2 paths per LUN.

Retested on Ubuntu 18.04, kernel 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:14:23 UTC 2018
-> no problem found with 1024 LUNs, 2 paths per LUN.

I suggest to close this bug.

Changed in multipath-tools (Ubuntu):
status: Incomplete → Fix Released
Changed in ubuntu-z-systems:
status: Incomplete → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

Many thx Thorsten!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-06-11 06:33 EDT-------
IBM bugzilla status closed; Fixed and verified.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.