Multipath devices not removed with high load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
multipath-tools (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Medium
|
Jorge Merlino |
Bug Description
[Impact]
In a server with high volume of multipath volume creation and teardown it can occur a race condition that keeps a multipath volume that should have been removed with no devices or with failed or unknown devices.
In particular this occurs when a multipath device is removed during, or immediately before the call to check_path(). A missing multipath device will cause update_
If the path is up, this state will cause reinstate_path() to be called, which will also fail. This will trigger a reload, restoring the recently removed device in an invalid state.
The command "multipathd show config" fails with "timeout receiving packet".
The command "multipath -ll" shows errors and names/paths as '#' (unexpected):
Sep 28 20:31:20 | 65:208: cannot find block device
Sep 28 20:31:20 | 65:208: Empty device name
Sep 28 20:31:20 | 65:208: Empty device name
Sep 28 20:31:20 | 65:240: cannot find block device
Sep 28 20:31:20 | 65:240: Empty device name
Sep 28 20:31:20 | 65:240: Empty device name
Sep 28 20:31:20 | 65:224: cannot find block device
Sep 28 20:31:20 | 65:224: Empty device name
Sep 28 20:31:20 | 65:224: Empty device name
Sep 28 20:31:20 | 66:0: cannot find block device
Sep 28 20:31:20 | 66:0: Empty device name
Sep 28 20:31:20 | 66:0: Empty device name
3600a098038
size=19G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy=
| |- #:#:#:# - #:# failed undef unknown
| `- #:#:#:# - #:# failed undef unknown
`-+- policy=
|- #:#:#:# - #:# failed undef unknown
`- #:#:#:# - #:# failed undef unknown
[Test Plan]
This can be reproduced by one of our customers on a Kubernetes installation using Netapp Trident for storage orchestration.
Continuously add and remove all the individual paths for storage devices in
order to create and remove multipath devices managed by multipathd, while
checking the output of `multipath -ll`.
[Where problems could occur]
We are merging two patches here (both done by the same person on the same day).
- One just changes the return value of some functions from a boolean to symbolic codes to differentiate errors.
- The other one uses this new error codes to change the behavior of the check_path() function.
These changes mostly affect the error paths so this should not break anything
when the code works without errors, but theoretically might also affect the
normal path of periodically checking path status -- fortunaly this runs very
often, thus regressions should be easy and quick to spot during tests.
There are 2 additional upstream patches for the lines changed by these patches,
which are NULL pointer checks, and are introduced here too.
[Other Info]
Most patches only need to be applied to Focal as they're included Jammy.
One of the additional upstream patches needs to go to Jammy and Lunar, which is
done in bug 2042366 in order to get specific/individual testing on the releases.
Changed in multipath-tools (Ubuntu): | |
status: | New → Fix Released |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Uploading debdiff for Focal. This is based over the current package in -proposed.