multipath shows '#:#:#:#' for iscsi device after error injection

Bug #1815599 reported by bugproxy
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Invalid
Medium
bugproxy
multipath-tools (Ubuntu)
Invalid
High
Unassigned

Bug Description

Problem Description:
After error injection(reset one node for storage), 1 of 4 luns show '####' for half paths

---uname output---
root@ilzlnx4:~# uname -a
Linux ilzlnx4 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:43:05 UTC 2018 s390x s390x s390x GNU/Linux

Machine Type = s390x

--iscsi initiator
root@ilzlnx4:~# dpkg -l | grep iscsi
ii open-iscsi 2.0.874-5ubuntu2.6 s390x iSCSI initiator tools

---Debugger---
A debugger is not configured

---Steps to Reproduce---
1 Mapping 4 luns via open-iscsi from SVC
2 Running IO on these luns
3 Run storage side error inject 'node reset' for SVC (about start at ?2019/02/11 05:14?)
4 half of one luns' path show '#:#:#:#' and never recovered without manual intervention

[2019/02/11 05:53:13] INFO send: multipath -ll | cat
[2019/02/11 05:53:29] INFO
3600507638085814a980000000000000a dm-3 IBM,2145
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 4:0:0:3 sdr 65:16 active ready running
| `- 6:0:0:3 sdu 65:64 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:3 sdh 8:112 active ready running
  `- 2:0:0:3 sdl 8:176 active ready running
3600507638085814a9800000000000009 dm-4 IBM,2145
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:1 sdf 8:80 active ready running
| `- 2:0:0:1 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
  |- #:#:#:# sdo 8:224 active faulty running
  `- #:#:#:# sds 65:32 active faulty running
3600507638085814a9800000000000008 dm-2 IBM,2145
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 4:0:0:2 sdq 65:0 active ready running
| `- 6:0:0:2 sdt 65:48 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 2:0:0:2 sdk 8:160 active ready running
  `- 1:0:0:2 sdg 8:96 active ready running
3600507638085814a9800000000000006 dm-5 IBM,2145
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 4:0:0:0 sdm 8:192 active ready running
| `- 6:0:0:0 sdp 8:240 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:0 sde 8:64 active ready running
  `- 2:0:0:0 sdi 8:128 active ready running

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-175431 severity-medium targetmilestone-inin---
bugproxy (bugproxy)
tags: added: targetmilestone-inin18041
removed: targetmilestone-inin---
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1815599/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Frank Heimes (fheimes) wrote :

I'm not sure what the exact problem is that's reported here.
Is it the string "#:#:#:#" itself that replaces the disk ID in case of an error with the disk/path.
And if so - which problem is it causing then?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-02-13 02:58 EDT-------
These issue paths cannot be recovered for a long time by itself, if we lost other 2 healthy paths, data will be lost

Frank Heimes (fheimes)
affects: ubuntu → multipath-tools (Ubuntu)
tags: added: rls-dd-incoming
Changed in multipath-tools (Ubuntu):
importance: Undecided → High
Revision history for this message
Frank Heimes (fheimes) wrote :

Some configurations are possible to face this and similar issues, please have a look here:
http://public.dhe.ibm.com/software/dw/linux390/lvc/zFCP_Best_Practices-BB-Webcast_201805.pdf

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (4.8 KiB)

@JFH - As just discussed you have mentioned that you have discussed the existing tunables to Heinz Werner. Thanks for linking these here again to make sure that Shixm knows.

Somewhat reminds me of bug 1540407 - but those changes are in Ubuntu already since 16.04.
Same for the even older bug 1374999.
Your kernel and open-iscsi versions indicate that you are on Bionic is that correct?
Mutlipath tools should be on

Your repro of:
  > Run storage side error inject 'node reset' for SVC
isn't clear to me, neither do I have a Storage server I'm allowed to do error in ject nor the tools/UI to control it.

Instead I have tried the repro that are available to me as in:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/7
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/8

But all of them worked, see below.
Note that I never reached the loss of the path info in the faulty state '#:#:#:#' - even in faulty state it kew the path info.

@Shixm - since you have a setup that can reproduce this, can you try if any of the latter releases (Cosmic/Disco) already resolve the issue that you are seeing? Then we could try to hunt down which changes might have resolved it for you instead of assuming this would need a totally new change.

@Shixm - Any way to reproduce this without 'node reset' for SVC?

Finally this might as well need subject matter expertise - can we make sure that IBMs zfcp experts (Devs and maybe Thorsten who drove the old bugs) are subscribed on the mirrored bug 175431?
@JFH - do you think you can check that with the IBM team?

------------
Test results when retrying to trigger the issue:

Approach #1 gives me this (which isn't exactly the same state)
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:0:1073954852 sdb 8:16 active faulty offline
  `- 0:0:1:1073954852 sdj 8:144 active ready running
If I add it abck after this it works just fine again:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active ready running
  `- 0:0:0:1073954852 sdb 8:16 active ready running

The second approach (Disable, sleep, enable of the adapter), check zfcp config
lszdev -t zfcp-host
DEVICE TYPE zfcp
  Description : SCSI-over-Fibre Channel (FCP) devices and SCSI devices
  Modules : zfcp
  Active : yes
  Persistent : yes

  ATTRIBUTE ACTIVE PERSISTENT
  allow_lun_scan "1" "1"
  datarouter "1" -
  dbflevel "3" -
  dbfsize "4" -
  dif "0" -
  no_auto_port_rescan "0" -
  port_scan_backoff "500" -
  port_scan_ratelimit "60000" -
  queue_depth "32" -

Init...

Read more...

Changed in multipath-tools (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The loss of that path info suggests that it lost the knowledge about that.
I found very old bug 1032550 - the assumption there was that it would eventually be a kernel bug due to sysfs loosing the information multipath might need. But back then the actions on the bug stopped (I have no insight why as I wan't around at the time).

Even more so than already before I'd be happy to have the IBM zfcp Devs subscribed as they surely know better about that part of the multipath/zfcp stack.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Triaged
assignee: nobody → Canonical Server Team (canonical-server)
importance: Undecided → Medium
Revision history for this message
Frank Heimes (fheimes) wrote :

I'll ask to get the SME's on zFCP pulled into this topic, too - so that this can be looked at from different angles ...

Revision history for this message
Frank Heimes (fheimes) wrote :

Changing to Incomplete until feedback from IBM SMEs ...

Changed in ubuntu-z-systems:
status: Triaged → Incomplete
assignee: Canonical Server Team (canonical-server) → bugproxy (bugproxy)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-07-15 05:42 EDT-------
Problem could not be reproduced within our last cycle testing on Ubuntu 18.04.2 4.15.0-51-generic.
This problem will be closed. New new bugzilla will be opended if the problem come up again.. Thx

------- Comment From <email address hidden> 2019-07-15 05:43 EDT-------
IBM bugzilla status -> closed, not reproduceable with Ubuntu 18.04.2

Frank Heimes (fheimes)
Changed in multipath-tools (Ubuntu):
status: Incomplete → Invalid
Changed in ubuntu-z-systems:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.