cluster log slow request spam

Bug #1909162 reported by Dan Hill
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
High
Unassigned
Train
Fix Released
High
gerald.yang
Ussuri
Fix Released
High
gerald.yang
ceph (Ubuntu)
Fix Released
High
gerald.yang
Focal
Fix Released
High
gerald.yang
Groovy
Fix Released
High
gerald.yang
Hirsute
Fix Released
High
gerald.yang

Bug Description

[Impact]

A recent change (issue#43975 [0]) was made to slow request logging to include detail on each operation in the cluster logs. With this change, detail for every slow request is always sent to the monitors and added to the cluster logs.

This does not scale. Large, high-throughput clusters can overwhelm their monitors with spurious logs in the event of a performance issue. Disrupting the monitors can then cause further instability in the cluster.

This SRU reverts the cluster logging of every slow request the osd is processing.

The slow request clog change was added in nautilus (14.2.10) and octopus (15.2.0).

[Test Case]

Stress the cluster with a benchmarking tool to generate slow requests and observe the cluster logs.

[Where problems could occur]

The cluster logs contain detailed debug information on slow requests that is useful for smaller, low-throughput clusters. While these logs are not used by ceph, they may be used by the cluster administrators (for monitoring or alerts). Changing this logging behavior may be unexpected.

[Other Info]

The intent is to re-enable this feature behind a configurable setting, but the solution must be discussed upstream.

The same slow request detail can be enabled for each osd by raising the "debug osd" log level to 20.

[0] https://tracker.ceph.com/issues/43975

Related branches

Dan Hill (hillpd)
tags: added: seg sts
Dan Hill (hillpd)
Changed in ceph (Ubuntu Hirsute):
status: New → In Progress
importance: Undecided → High
Changed in ceph (Ubuntu Groovy):
importance: Undecided → High
Changed in ceph (Ubuntu Focal):
importance: Undecided → High
Changed in cloud-archive:
importance: Undecided → High
Changed in ceph (Ubuntu Groovy):
status: New → In Progress
Changed in ceph (Ubuntu Focal):
status: New → In Progress
Changed in cloud-archive:
status: New → In Progress
Changed in ceph (Ubuntu Focal):
assignee: nobody → gerald.yang (gerald-yang-tw)
Changed in ceph (Ubuntu Groovy):
assignee: nobody → gerald.yang (gerald-yang-tw)
Changed in ceph (Ubuntu Hirsute):
assignee: nobody → gerald.yang (gerald-yang-tw)
tags: added: sts-sru-needed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "0001-Remove-logging-every-slow-request-details-to-monito.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

focal patch

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

groovy patch

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

hirsute patch

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

train patch

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

ussuri patch

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

Open a upstream ceph tracker for further discussion:
https://tracker.ceph.com/issues/48909

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I'm working on this bug for Focal and Groovy (on Octopus) and James Page is going to pick this patch on top of a new Ceph Pacific release. All three of these should be uploaded fairly soon.

Revision history for this message
James Page (james-page) wrote :

Hirsute is currently blocked as I'm working on an interim release ready for Ceph Pacific - I'm working through a number of 32 bit related issues on armhf which is taking some time due to build durations on this architecture.

I have included the patch for this issue in this work:

https://code.launchpad.net/~ubuntu-server-dev/ubuntu/+source/ceph/+git/ceph/+ref/ubuntu/pacific-snapshot

commit:

https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/ceph/commit/?id=295acf66219bcfd7a2059b5d47a6b5120d23db2e

It would be good if we could move forward with the SRU's for focal and groovy prior to this landing into the development release.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hey James! As we already talked on IRC: I think it's fine to move on the SRU front without this being released for devel (hirsute) yet, as long as the fix is present and staged for a near-future release (and gets there before hirsute is out). So please proceed, but also keep on working on the hirsute part in the meantime.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted ceph into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/15.2.8-0ubuntu0.20.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Groovy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-groovy
Changed in ceph (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello Dan, or anyone else affected,

Accepted ceph into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/15.2.8-0ubuntu0.20.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Hello Dan, or anyone else affected,

Accepted ceph into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

The ceph package in ussuri-proposed fixes this issue

Detail steps:
1. deploy bionic+luminous
2. add cloud-archive:train
3. upgrade ceph to nautilus
4. add cloud-archive:ussuri
5. upgrade ceph to octopus
6. set osd_op_complaint_time to 0.1 for generating slow requests
7. run rados bench
8. check there are slow requests in OSD logs
9. check there is NO slow request details in ceph.log

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

The ceph package in focal-proposed fixes this issue

Detail steps:
1. deploy focal+octopus
2. add focal-proposed
3. upgrade to 15.2.8-0ubuntu0.20.04.1
4. set osd_op_complaint_time to 0.1 for generating slow requests
5. run rados bench
6. check there are slow requests in OSD logs
7. check there is NO slow request details in ceph.log

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

In comment #15, the detail step 4 is
4. add cloud-archive:ussuri-proposed

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

The ceph package in groovy-proposed fixes this issue

Detail steps:
1. deploy grovvy+octopus
2. add grovvy-proposed
3. upgrade to 15.2.8-0ubuntu0.20.10.1
4. set osd_op_complaint_time to 0.1 for generating slow requests
5. run rados bench
6. check there are slow requests in OSD logs
7. check there is NO slow request details in ceph.log

tags: added: verification-done verification-done-groovy
removed: verification-needed verification-needed-groovy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 15.2.8-0ubuntu0.20.10.1

---------------
ceph (15.2.8-0ubuntu0.20.10.1) groovy; urgency=medium

  [ Chris MacNaughton ]
  * New upstream point release (LP: #1912355):
    - d/cephadm.install, d/librgw-dev.install, d/librgw2.install: Upstream
      point release removes files that were being installed.
    - d/rules: Remove installation of /etc/sudoers.d/cephadm as it is
      removed upstream.
  * d/p/disable-log-slow-requests.patch: Remove logging every slow request
    details to monitors LP: #1909162).

  [ Ponnuvel Palaniyappan ]
  * d/p/bug1911900-fix-scrub-blocking-balancer.patch:
    Prevent scrub from stopping balancer (LP: #1911900)

 -- Ponnuvel Palaniyappan <email address hidden> Thu, 04 Feb 2021 11:18:13 +0000

Changed in ceph (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 15.2.8-0ubuntu0.20.04.1

---------------
ceph (15.2.8-0ubuntu0.20.04.1) focal; urgency=medium

  [ Chris MacNaughton ]
  * New upstream point release (LP: #1912355):
    - d/rules,cephadm.install,librgw-dev.install,librgw2.install: Drop files
      no longer included in point release.
  * d/p/disable-log-slow-requests.patch: Remove logging every slow request
    details to monitors LP: #1909162).

  [ Ponnuvel Palaniyappan ]
  * d/p/bug1911900-fix-scrub-blocking-balancer.patch:
    Prevent scrub from stopping balancer (LP: #1911900)

 -- Ponnuvel Palaniyappan <email address hidden> Thu, 04 Feb 2021 11:28:51 +0000

Changed in ceph (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package ceph - 15.2.8-0ubuntu0.20.04.1~cloud0
---------------

 ceph (15.2.8-0ubuntu0.20.04.1~cloud0) bionic-ussuri; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 ceph (15.2.8-0ubuntu0.20.04.1) focal; urgency=medium
 .
   [ Chris MacNaughton ]
   * New upstream point release (LP: #1912355):
     - d/rules,cephadm.install,librgw-dev.install,librgw2.install: Drop files
       no longer included in point release.
   * d/p/disable-log-slow-requests.patch: Remove logging every slow request
     details to monitors LP: #1909162).
 .
   [ Ponnuvel Palaniyappan ]
   * d/p/bug1911900-fix-scrub-blocking-balancer.patch:
     Prevent scrub from stopping balancer (LP: #1911900)

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 16.1.0-0ubuntu2

---------------
ceph (16.1.0-0ubuntu2) hirsute; urgency=medium

  * No change rebuild with fixed ownership.

 -- Dimitri John Ledkov <email address hidden> Tue, 16 Feb 2021 15:12:17 +0000

Changed in ceph (Ubuntu Hirsute):
status: In Progress → Fix Released
Changed in cloud-archive:
status: In Progress → Fix Committed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 16.1.0-0ubuntu3~cloud0
---------------

 ceph (16.1.0-0ubuntu3~cloud0) focal-wallaby; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 ceph (16.1.0-0ubuntu3) hirsute; urgency=medium
 .
   * d/p/issue49494.patch: Cherry pick fix for issue with preprocessor
     logic which causes backport failures to focal.
   * d/p/bug1917414.patch: Cherry pick fix to isa-l to remove use of text
     relocation calls which cause ceph-osd and ceph-mon daemons to fail
     to start (LP: #1917414).
 .
 ceph (16.1.0-0ubuntu2) hirsute; urgency=medium
 .
   * No change rebuild with fixed ownership.
 .
 ceph (16.1.0-0ubuntu1) hirsute; urgency=medium
 .
   * New interim release in preparation for Ceph Pacific.
   * d/p/*: Refresh, drop any patches included upstream.
   * d/control,ceph-mgr-diskprediction-cloud.*: Drop ceph-mgr-
     diskprediction-cloud package, feature dropped upstream.
   * d/ceph-mgr-modules-core.install: Include new snap_schedule and stats
     modules.
   * d/ceph-osd.install: Include ceph-erasure-code-tool binary.
   * d/control: Add libcryptsetup-dev to BD's.
   * d/control: Add liblua5.3-dev and luarocks to BD's.
   * d/control: Drop use of python3-six.
   * d/control: Add python3-jinja2 to Depends of ceph-mgr-cephadm.
   * d/libcephfs-dev.install: Add new Types.h header.
   * d/librgw{2,-dev}.install: Drop header and so for librgw_admin_user.
   * d/python3-cephfs.install: Drop install of ceph_volume_client.py.
   * New upstream snapshot for Pacific release.
   * d/control: Add libboost-filesystem-dev to BD's, bump boost minimum
     version to 1.74.0.
   * d/rules: Install grafana dashboards.
   * d/p/fix-boost-1.74-build.patch: Resolve build failure with boost
     1.74/c++ 17.
   * d/rules: Drop install of cephadm sudoers configuration.
   * d/cephadm.install: Drop sudoers file, include manpage.
   * d/*.symbols: Update for new release.
   * d/control,rules: Enable use of boost context for riscv64 as its no
     longer an optional dependency.
   * d/p/32bit-fixes.patch: Fix issues with mismatched size_t max
     comparision on armhf.
   * d/p/disable-log-slow-requests.patch: Remove logging every slow request
     details to monitors LP: #1909162).

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted ceph into train-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:train-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-train-needed
Revision history for this message
Dan Hill (hillpd) wrote :

Verified the ceph package in nautilus-proposed (14.2.22-0ubuntu0.19.10.1~cloud2)

Detail steps:
1. deployed bionic+nautilus with `--force`
2. added nautilus-proposed
3. upgraded ceph to `14.2.22-0ubuntu0.19.10.1~cloud2` and restarted ceph services
4. ran a write benchmark: `sudo rados bench -p bench_pool 30 write --no-cleanup`
5. lowered slow request complaint threshold: `sudo ceph config set osd osd_op_complaint_time 0.1`
6. increased osd debug: `sudo ceph config set osd debug_osd 20`
7. verified slow request debug detail is present in the osd logs
8. verified slow request debug detail is NOT present in the cluster log (ceph.log)

tags: added: verification-train-done
removed: verification-train-needed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

This bug was fixed in the package ceph - 14.2.22-0ubuntu0.19.10.1~cloud2
---------------

 ceph (14.2.22-0ubuntu0.19.10.1~cloud2) bionic; urgency=medium
 .
   * d/p/disable-log-slow-requests.patch: Remove logging every slow
     request details to monitors LP: #1909162).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.