Multipath devices not removed with high load

Bug #2039719 reported by Jorge Merlino
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
multipath-tools (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Medium
Jorge Merlino

Bug Description

[Impact]

In a server with high volume of multipath volume creation and teardown it can occur a race condition that keeps a multipath volume that should have been removed with no devices or with failed or unknown devices.

In particular this occurs when a multipath device is removed during, or immediately before the call to check_path(). A missing multipath device will cause update_multipath_strings() to fail, setting pp->dmstate to PSTATE_UNDEF.

If the path is up, this state will cause reinstate_path() to be called, which will also fail. This will trigger a reload, restoring the recently removed device in an invalid state.

The command "multipathd show config" fails with "timeout receiving packet".

The command "multipath -ll" shows errors and names/paths as '#' (unexpected):

    Sep 28 20:31:20 | 65:208: cannot find block device
    Sep 28 20:31:20 | 65:208: Empty device name
    Sep 28 20:31:20 | 65:208: Empty device name
    Sep 28 20:31:20 | 65:240: cannot find block device
    Sep 28 20:31:20 | 65:240: Empty device name
    Sep 28 20:31:20 | 65:240: Empty device name
    Sep 28 20:31:20 | 65:224: cannot find block device
    Sep 28 20:31:20 | 65:224: Empty device name
    Sep 28 20:31:20 | 65:224: Empty device name
    Sep 28 20:31:20 | 66:0: cannot find block device
    Sep 28 20:31:20 | 66:0: Empty device name
    Sep 28 20:31:20 | 66:0: Empty device name

    3600a098038305768462b564d48336636 dm-38 ##,##
    size=19G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- #:#:#:# - #:# failed undef unknown
    | `- #:#:#:# - #:# failed undef unknown
    `-+- policy='service-time 0' prio=0 status=enabled
    |- #:#:#:# - #:# failed undef unknown
    `- #:#:#:# - #:# failed undef unknown

[Test Plan]

This can be reproduced by one of our customers on a Kubernetes installation using Netapp Trident for storage orchestration.

Continuously add and remove all the individual paths for storage devices in
order to create and remove multipath devices managed by multipathd, while
checking the output of `multipath -ll`.

[Where problems could occur]

We are merging two patches here (both done by the same person on the same day).
- One just changes the return value of some functions from a boolean to symbolic codes to differentiate errors.
- The other one uses this new error codes to change the behavior of the check_path() function.

These changes mostly affect the error paths so this should not break anything
when the code works without errors, but theoretically might also affect the
normal path of periodically checking path status -- fortunaly this runs very
often, thus regressions should be easy and quick to spot during tests.

There are 2 additional upstream patches for the lines changed by these patches,
which are NULL pointer checks, and are introduced here too.

[Other Info]

Most patches only need to be applied to Focal as they're included Jammy.
One of the additional upstream patches needs to go to Jammy and Lunar, which is
done in bug 2042366 in order to get specific/individual testing on the releases.

Changed in multipath-tools (Ubuntu):
status: New → Fix Released
Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Uploading debdiff for Focal. This is based over the current package in -proposed.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Subscribed ~se-sponsors.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):

Hi Jorge,

Thanks for the nice SRU template and debdiff!

Patch 1 has one minor nit; no worries.
Patch 2 has an important question and probably needs a change.

It turns out that both patches have fixes upstream which should be applied,
they fix crashes introduced by these code changes.

I tried to reproduce the issue with GDB (time multipathd and map removal)
based on the SRU impact description/patch 2, but did not make it yet.

Therefore, we should probably have the original reporter verify these
new changes still work correctly (added missing hunk in patch 2, and
added fixup commits in patches 3-4).

Please find the debdiff attached, and test packages in ppa:mfo/test.

Thanks,
Mauricio

...

Patch 1:

Changed 'Origin:' keyword to backport; there are context lines changes to upstream:

 @ libmultipath/devmapper.h

 Upstream:
 ...
  void dm_init(int verbosity);
  int dm_prereq(unsigned int *v);
  void skip_libmp_dm_init(void);

 Patch:
 ...
  void dm_init(int verbosity);
  void libmp_dm_init(void);
  void libmp_udev_set_sync_support(int on);

 @ libmultipath/structs_vec.c

 Upstream:
 ...
  static void enter_recovery_mode(struct multipath *mpp)

 Patch:
 ...
  void enter_recovery_mode(struct multipath *mpp)

...

Patch 2:

No backport description or explanation of why changes to reinstate_path() were dropped.

Is that an oversight? The changes related to the different return type are included
(e.g., remove reinstate_path() from 2 `if` checks and call it without return checks).

Apparently, it looks like it; I re-introduced the hunk/backport usage of add_active.

...

Fixups:

These 2 patches have fixes upstream:

 $ git log --oneline -1 e282f54a5d57e
 e282f54a5d57 multipathd fix NULL dereference in check_path

 $ git describe --contains e282f54a5d57e
 0.8.6~1^2~17

 $ git log --oneline -1 7439a41fa12dd
 7439a41fa12d dm_get_map: fix segfault when can't found target

 $ git describe --contains 7439a41fa12dd
 0.9.0^2~8

They're both related to iSCSI, which IIRC can be used by Openstack Cinder volumes (Ubuntu is very popular for Openstack deployments) in multipath mode.

The second fixup should be included in Lunar [update: and Jammy!]:

 $ rmadison -a source multipath-tools
 ...
  multipath-tools | 0.8.3-1ubuntu2 | focal | source
  multipath-tools | 0.8.3-1ubuntu2.1 | focal-security | source
  multipath-tools | 0.8.3-1ubuntu2.1 | focal-updates | source
  multipath-tools | 0.8.3-1ubuntu2.2 | focal-proposed | source
  multipath-tools | 0.8.8-1ubuntu1 | jammy | source
  multipath-tools | 0.8.8-1ubuntu1.22.04.1 | jammy-security | source
  multipath-tools | 0.8.8-1ubuntu1.22.04.1 | jammy-updates | source
  multipath-tools | 0.8.8-1ubuntu1.22.04.3 | jammy-proposed | source
  multipath-tools | 0.8.8-1ubuntu2 | lunar | source
  multipath-tools | 0.8.8-1ubuntu2.2 | lunar-proposed | source
  multipath-tools | 0.9.4-5ubuntu3 | mantic | source
  multipath-tools | 0.9.4-5ubuntu3 | noble | source

Changed in multipath-tools (Ubuntu Focal):
status: New → Incomplete
importance: Undecided → Medium
assignee: nobody → Jorge Merlino (jorge-merlino)
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Hi Mauricio,

Thank you for working on this SRU.

Regarding the changes to reinstate_path(), I dropped them because in this patch they were only cosmetic changes just removing the return value. They did not merge cleanly because the function has another parameter in the original Focal version. So, to minimize the impact of the patch, I skipped that part as the return value is being ignored now so does not change anything if it is there or not. Sorry for not making that clear.

About the second patch you mention (7439a41fa12dd), as it is not present in any version before Mantic, I understand we should backport it to Jammy also right? You only mentioned Lunar.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Jorge,

Thanks for clarifying!

The return value is a functional change, and might generate compiler warnings depending on gcc version (orr even errors if built with `gcc -Werror`, which is not the case here -- the build works), so since it's a simple backport to have the function patched too, let's keep this hunk.

Yes, you're right; I missed Jammy, thanks!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Jorge, I found a synthetic reproducer for commit 7439a41fa12dd, and will report a LP bug to SRU it to Lunar and Jammy.

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Status update.

Jorge (who's at a company internal event in another timezone)
has passed on to me the information that the original reporter
successfully verified the test packages in ppa:mfo/test, with
confidence that it addressed the issue.

More details about the testing/setup will be provided later,
but I'm happy to upload this for now, based on his feedback.

Uploaded to Focal.

Changed in multipath-tools (Ubuntu Focal):
status: Incomplete → In Progress
Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Updating with more information about testing. This is the information provided by the original reporter for this issue:

From a high level we ultimately ran tests that were periodically (approx every 5 mins) mounting volumes by creating new multipath entries and /dev/dm-X entries then these would get torn down after use (involved importing ~28GB of data for each volume).

Along side those tests we were running scripts that were checking for entries on the output of "multipath -ll" that aligned with the format:

3600a098038305768315d5650546f5472 dm-55 ##,##
size=28G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- #:#:#:# - #:# failed undef unknown
| `- #:#:#:# - #:# failed undef unknown
`-+- policy='service-time 0' prio=0 status=enabled
  |- #:#:#:# - #:# failed undef unknown
  `- #:#:#:# - #:# failed undef unknown

We effectively performed A,B,A testing in that we first confirmed we were seeing the problem multipath entries in multipath-tools_0.8.3-1ubuntu2.1 & kpartx_0.8.3-1ubuntu2.1 then we cleaned out all error entries then upgraded to the new versions (multipath-tools_0.8.3-1ubuntu2.3 & kpartx_0.8.3-1ubuntu2.3) and allowed them to run for roughly 5 days and confirmed we were no longer seeing the error entries. Once that ran for a bit we then reverted back to the original multipath-tools and confirmed we were seeing error entries again.

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Jorge, or anyone else affected,

Accepted multipath-tools into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/multipath-tools/0.8.3-1ubuntu2.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in multipath-tools (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Robie Basak (racb) wrote :

The description of lp2039719-2-multipathd-fix-check_path-errors-with-removed-map.patch says:

> Looking through the current kernel code, I can't see any reason why a reinstate would fail, where a reload would help.

Could you please confirm that the relevant kernel code isn't different in the Focal release kernel? Since the kernel ABI is stable, but inferring the ABI from a snapshot of the code doesn't necessarily follow. This seems unlikely so I won't block accepting this to proposed on this though.

> Continuously add and remove all the individual paths for storage devices in order to create and remove multipath devices managed by multipathd...

Unless I misunderstand, could you please clarify the steps you are using to do this?

Minor observation: debian/patches/lp2042366-dm_get_map-fix-segfault-when-can-t-found-target.patch is different in Focal, but only in metadata.

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (multipath-tools/0.8.3-1ubuntu2.3)

All autopkgtests for the newly accepted multipath-tools (0.8.3-1ubuntu2.3) for focal have finished running.
The following regressions have been reported in tests triggered by the package:

ubuntu-image/1.11+20.04ubuntu1 (amd64, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/focal/update_excuses.html#multipath-tools

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The autopkgtests issues are unrelated to these changes, and also failed with the migration-reference/0 trigger.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Uploaded a fix for these autopkgtests issues (bug 2044508):
> ubuntu-image/1.11+20.04ubuntu1 (amd64, ppc64el)

Revision history for this message
Jorge Merlino (jorge-merlino) wrote (last edit ):

Hi Robbie / SRU team,

I've compared the reinstate_path function on drivers/md/dm-mpath.c between the focal kernel and the last 5.7 kernel (that was the last released kernel on July 2020) and they are almost the same. There are two additional lines in 5.7 but they can't make the function fail. In fact, that kernel function always returns 0, so the only failure path can be on userspace. On userspace the function dm_reinstate_path can have non zero return values (mainly via the dm_message function) but this code is also not modified in the current patch. Thus, I think patch's original author thoughts should still be valid on focal.

Regarding the testing steps, there is an extended comment (#9) where I explain the testing procedure with some more detail. Is that enough or you still have doubts?

Thanks
Jorge

Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Testing has been completed using package multipath-tools_0.8.3-1ubuntu2.3_amd64.deb. The testing procedure was the same indicated on comment #9

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package multipath-tools - 0.8.3-1ubuntu2.3

---------------
multipath-tools (0.8.3-1ubuntu2.3) focal; urgency=medium

  [ Jorge Merlino ]
  * Multipath devices not removed with high load (LP: #2039719)
    - d/p/lp2039719-1-libmultipath-make-dm_get_map-status-return-codes-sym.patch:
      change the return value of some functions from a boolean to symbolic codes
      to differentiate errors
    - d/p/lp2039719-2-multipathd-fix-check_path-errors-with-removed-map.patch:
      use the new error codes to change the error processing on the
      check_path() procedure

  [ Mauricio Faria de Oliveira ]
  * Add upstream fixes:
    - d/p/lp2039719-3-multipathd-fix-NULL-dereference-in-check_path.patch
    - d/p/lp2042366-dm_get_map-fix-segfault-when-can-t-found-target.patch
      (LP: #2042366)

 -- Jorge Merlino <email address hidden> Tue, 07 Nov 2023 18:20:27 -0300

Changed in multipath-tools (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for multipath-tools has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.