[SRU] mgr can be very slow in a large ceph cluster

Bug #1906496 reported by dongdong tao
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Queens
Fix Released
High
Ponnuvel Palaniyappan
Stein
Fix Released
High
Ponnuvel Palaniyappan
Train
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned
ceph (Ubuntu)
Fix Released
High
Ponnuvel Palaniyappan
Bionic
Fix Released
High
Ponnuvel Palaniyappan
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
High
Ponnuvel Palaniyappan

Bug Description

[Impact]
Ceph upstream implemented a new feature [1] that will check/report those long network ping times between OSDs, but it introduced an issue that ceph-mgr might be very slow because it needs to dump all the new OSD network ping stats [2] for some tasks, this can be bad especially when the cluster has large number of OSDs.

Since these kind OSD network ping stats doesn't need to be exposed to the python mgr module. So, it only makes the mgr doing more work than it needs to, it could cause the mgr slow or even hang and could cause the CPU usage of mgr process constantly high. The fix is to disable the ping time dump for those mgr python modules.

This resulted in ceph-mgr not responding to commands and/or hanging (and had to be restarted) in clusters with many OSDs.

[0] is the upstream bug. It was backported to Nautilus but rejected for Luminous and Mimic because they reached EOL in upstream. But I want to backport to these two releases Ubuntu/UCA.

The major fix from upstream is here [3], and also I found an improvement commit [4] that submitted later in another PR.

[Test Case]
Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr dumps the network ping stats regularly, this problem would manifest. This is relatively hard to reproduce as the ceph-mgr may not always get overloaded and thus not hang.

A simpler version could be to deploy a Ceph cluster with as many OSDs as the hardware/system setup allows (not necessarily 600+) and drive I/O on the cluster for sometime (say, 60 mins). Then various queries could be sent to the manager to verify it does report and doesn't get stuck.

[Regression Potential]
Fix has been accepted upstream (the changes are here in "sync" with upstream to the extent these old releases match the latest source code) and have been confirmed to work. So the risk is minimal.

At worst, this could affect modules that consume the stats from ceph-mgr (such as prometheus or other monitoring scripts/tools) and thus becomes less useful. But still shouldn't cause any problems to the operations of the cluster itself.

[Other Info]
- In addition to the fix from [1], another commit [4] is also cherry-picked and backported here - this was also accepted upstream.

- Since the ceph-mgr hangs when affected, this also impact sosreport collection - commands time out as the mgr doesn't respond and thus info get truncated/not collected in that case. This fix should help avoid that problem in sosreports.

[0] https://tracker.ceph.com/issues/43364
[1] https://github.com/ceph/ceph/pull/28755
[2] https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
[3] https://github.com/ceph/ceph/pull/32406
[4] https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

dongdong tao (taodd)
description: updated
summary: - mgr can be very slow within a large ceph cluster
+ mgr can be very slow in a large ceph cluster
Changed in ceph (Ubuntu):
assignee: nobody → Ponnuvel Palaniyappan (pponnuvel)
tags: added: sts
Revision history for this message
Dan Hill (hillpd) wrote : Re: mgr can be very slow in a large ceph cluster

Just a quick note:

This bug is causing sosreport to time out commands. This can truncate important items like `ceph pg dump` on larger clusters.

Changed in ceph (Ubuntu):
status: New → Incomplete
status: Incomplete → New
status: New → Confirmed
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Attaching debdiff for Bionic (Ceph 12.2.13).

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Attaching diff for Bionic (Ceph 12.2.13) - same as before but easier to read.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Pon,

Thanks for the backport!

Not sure this is a requirement from the Openstack/Ceph team,
but I imagine that individual commits (3) should go each in
their own .patch files, as usually done with non-cloud SRUs.

I guess this only takes some light changes to your backport.
Happy to help if needed!

cheers,
Mauricio

Changed in ceph (Ubuntu):
status: Confirmed → Won't Fix
status: Won't Fix → In Progress
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

debdiff for 13.2.9

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Patch file for 13.2.9

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Pon,

Thanks for clarifying offline that the Openstack/Ceph team is okay with more than one commit per patch file.

So, on the debdiff 'noise', I've previously seen and usually removed that stuff; this seems to be due to the packaging not cleaning up files generated at build time, I guess.

You can use the handy `filterdiff` tool to keep only the files required, and get a cleaner debdiff:

$ cat debdiff-ceph-13.2.9 | filterdiff -i 'ceph-13.2.9/debian/changelog' -i 'ceph-13.2.9/debian/patches/*'

This gives us just the required changed files for the usual pattern of adding a .patch file. :)

Hope this helps,
Mauricio

summary: - mgr can be very slow in a large ceph cluster
+ [SRU] mgr can be very slow in a large ceph cluster
description: updated
Mathew Hodson (mhodson)
Changed in ceph (Ubuntu):
importance: Undecided → High
Changed in ceph (Ubuntu Bionic):
importance: Undecided → Medium
Changed in ceph (Ubuntu):
status: In Progress → Fix Released
Changed in ceph (Ubuntu Groovy):
status: New → Fix Released
Changed in ceph (Ubuntu Focal):
status: New → Fix Released
no longer affects: cloud-archive/victoria
Changed in cloud-archive:
status: New → Fix Released
Changed in ceph (Ubuntu Bionic):
importance: Medium → High
status: New → Triaged
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks for the patches Ponnuvel.

New versions of ceph with the sponsored changes to fix this bug have been uploaded to stein-staging and the bionic unapproved queue.

@Ponnuvel, would you mind helping to test this once it is available in proposed? I think the [Test Case] section above needs updating. I think it can be something simpler to verify that the fix is working as designed.

Thanks,
Corey

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

@Corey, yes, I am happy to do the SRU verification when the packages are available. I've updated the [Test case] section to note a simplified, functional test.

description: updated
description: updated
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello dongdong, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
Revision history for this message
Robie Basak (racb) wrote :

Hello dongdong, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.13-0ubuntu0.18.04.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

SRU tests:
Deployed a cluster with many OSDs with the new packages; I/O was driven from a VM (both reads & writes). Enabled a number mgr modules, too. And under load, the cluster was functioning and mgr was still responding. Attaching some relevant info on the tests here.

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Stein UCA 13.2.9-0ubuntu0.19.04.1~cloud3

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Bionic 12.2.13-0ubuntu0.18.04.6

tags: added: verification-needed-done verification-stein-done
removed: verification-needed-bionic verification-stein-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello dongdong, or anyone else affected,

Accepted ceph into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Changed in ceph (Ubuntu Bionic):
assignee: nobody → Ponnuvel Palaniyappan (pponnuvel)
description: updated
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Queens verification of 12.2.13-0ubuntu0.18.04.6~cloud0

Ran same tests on Queens - ceph mgr was functional and responsive under cluster load.

tags: added: verification-done verification-queens-done
removed: verification-needed verification-queens-needed
Mathew Hodson (mhodson)
tags: added: verification-bionic-done
removed: verification-needed-done
tags: added: verification-done-bionic
removed: verification-bionic-done
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.6

---------------
ceph (12.2.13-0ubuntu0.18.04.6) bionic; urgency=medium

  * d/p/bug1906496.patch: disable network stats in
    dump_osd_stats (LP: #1906496)

 -- Ponnuvel Palaniyappan <email address hidden> Mon, 07 Dec 2020 18:15:24 +0000

Changed in ceph (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 13.2.9-0ubuntu0.19.04.1~cloud3
---------------

 ceph (13.2.9-0ubuntu0.19.04.1~cloud3) bionic-stein; urgency=medium
 .
   * d/p/bug1906496.patch: disable network stats in
     dump_osd_stats (LP: #1906496)

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 12.2.13-0ubuntu0.18.04.6~cloud0
---------------

 ceph (12.2.13-0ubuntu0.18.04.6~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (12.2.13-0ubuntu0.18.04.6) bionic; urgency=medium
 .
   * d/p/bug1906496.patch: disable network stats in
     dump_osd_stats (LP: #1906496)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.