5.19 not reporting cgroups v1 blkio.throttle.io_serviced

Bug #2016186 reported by Jared Ledvina (Datadog)
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Kinetic
Fix Released
Undecided
Unassigned
Lunar
Fix Released
Undecided
Unassigned
Mantic
Incomplete
Undecided
Unassigned

Bug Description

[Impact]

Commit f382fb0bcef4 ("block: remove legacy IO schedulers") introduced a behavior change in the blkio throttle cgroup subsystem: IO statistics are not reported anymore unless a throttling rule is explicitly defined, because the current code only counts bios that are actually throttled.

This behavior change is potentially breaking some user-space
applications that are relying on the old behavior (see original bug
report below).

[Test case]

 - mount cgroup v1
 - create a blkio cgroup
 - move a task into the blkio cgroup
 - perform some I/O (i.e., dd)
 - read the IO stats for the cgroup (blkio.throttle.io_serviced and blkio.throttle.io_service_bytes in cgroupfs)
 - IO stats are all 0, unless a throttle rule is defined

Previous behavior (kernel 5.15) was showing I/O statistics even without throttling rules defined.

[Fix]

Apply / backport this fix:

https://<email address hidden>/t/

[Regression potential]

The fix is affecting the block IO cgroup subsystem, we may see potential regressions in this particular cgroup subsystem with this fix applied.

[Original bug report]

Hi,

I'm still investigating but, am a bit stuck. Here's what I've found so far.

Today I've upgraded some nodes in AWS EC2 from the previous v5.15 linux-aws package to the recently pusblished v5.19 package and rebooted. It seems that even when there's disk activity, the files:

/sys/fs/cgroup/blkio/blkio.throttle.io_serviced
/sys/fs/cgroup/blkio/blkio.throttle.io_service_bytes

Are only ever populated with 0's. Prior on v5.15 these would reflect the actual disk usage. No other system configuration changes were applied just the kernel upgrade and reboot. I've also verified that simply rebooting a v5.15 where this does work doesn't break the reporting. These EC2 instances are running with cgroups v1 due to other compatability issues and I suspect that might be the issue. So far, I cannot find any differences. mtab shows the same v1 mount setup, the kernel options match betwen v5.15 and v5.19.

I'm more than happy to fetch whatever info would help out here. I'd love to get 5.19 working for us but, we really need the data from these files.

Info:
Prior version that works: Linux ip-10-128-168-154 5.15.0-1031-aws #35-Ubuntu SMP Fri Feb 10 02:07:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Upgraded version that's broken: Linux ip-10-128-166-219 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

EC2 instances built off of the published 22.04 LTS AMI in us-east-1.

description: updated
Revision history for this message
Jared Ledvina (Datadog) (jaredledvina-dd) wrote :

A few clarifications from IRC:
1. We run all of our Ubuntu 22.04 LTS nodes with the kernel args 'systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=true' to force cgroups v1 as, unfortunately, we cannot safely turn on cgroups v2 yet (that's another pile of work I want to do!).
2. If you install 'linux-modules-extra-aws', 'modprobe bfq', and then 'echo bfq > /sys/block/nvme0n1/queue/scheduler' you will see stats in the '/sys/fs/cgroup/blkio/blkio.bfq.io_service*' files.
3. However, we continue to only see 0's in the '/sys/fs/cgroup/blkio/blkio.throttle.io_service*' files.

Potentially an upstream change but, definitely something that breaks with the '5.19.0.1022.23~22.04.6' Jammy package update. For me, this likely means I need to pin everything to the older 5.15 package pending cgroups v2 working or a fix to this. Obviously I'd prefer having this fixed so that we can get to 5.19 and stick w/ cgroups v1. I'd also offer a note that pushing 5.19 to Jammy without this support feels like a breaking change. I'm more worried that _other_ cgroups v1 controllers aren't working in a way I haven't noticed yet. Anyway, thanks so much for the help so far and gimme a holler if I can test/confirm anything else!

Revision history for this message
Andrea Righi (arighi) wrote :

It might be worth to report this also upstream, apparently there was a behavior change with the blkio controller in cgroup1 that happend at some point between 5.15 and 5.19. I'll do more tests and will let you know if I find anything relevant. Thanks for reporting this!

Revision history for this message
Jared Ledvina (Datadog) (jaredledvina-dd) wrote :

I've updated this to report the issue for linux-azure and linux-gcp as their jammy-updates repo's have recently updated to kernel 5.19 and appear to be affected as well. I could use some guidance on reporting this upstream as 5.19 doesn't seem to be a supported kernel version (looking https://www.kernel.org/) so it's unclear the correct way to go about that.

Related, https://packages.ubuntu.com/jammy-updates/linux-gcp-lts-22.04 doesn't exist as of right now. If that could be published similar to https://packages.ubuntu.com/jammy-updates/linux-azure-lts-22.04 and https://packages.ubuntu.com/jammy-updates/linux-aws-lts-22.04 that'd be a huge help for me.

Revision history for this message
Andrea Righi (arighi) wrote :

Potential upstream fix: https://<email address hidden>/

However this seems to partially restore the old behavior of cgroup v1, because we still need to set io throttling limits in order to get the io statistics.

We may need an additional fix like this to completely restore the old behavior: https://lore.kernel.org/lkml/ZEwY5Oo+5inO9UFf@righiandr-XPS-13-7390/

I'll follow the upstream thread, if we come up with a reasonable fix I'll take care of preparing a proper SRU for it.

Andrea Righi (arighi)
no longer affects: linux-aws (Ubuntu)
no longer affects: linux-azure (Ubuntu)
no longer affects: linux-gcp (Ubuntu)
Andrea Righi (arighi)
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2016186

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Kinetic):
status: New → Incomplete
Changed in linux (Ubuntu Lunar):
status: New → Incomplete
Andrea Righi (arighi)
description: updated
Changed in linux (Ubuntu Kinetic):
status: Incomplete → Fix Committed
Changed in linux (Ubuntu Lunar):
status: Incomplete → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.19.0-44.45 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux verification-needed-kinetic
Revision history for this message
Jared Ledvina (Datadog) (jaredledvina-dd) wrote :

Hey Andrea,
Thanks for the help getting this all fixed up. I see that the change is committed for Lunar and Kinetic.

Is there a good way for me to follow when this'll land for the Ubuntu Jammy linux-aws, linux-gcp, and linux-azure packages?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-5.19/5.19.0-1014.14 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-5.19 verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-hwe-6.2/6.2.0-23.23~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-hwe-6.2
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-6.2/6.2.0-1003.3~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-6.2
Revision history for this message
Luke Nowakowski-Krijger (lukenow) wrote :

Hi Jared you can follow the release of all of the Jammy cloud kernel here https://kernel.ubuntu.com/sru/dashboards/web/kernel-stable-board.html , where this fix is included in the 2023.05.15 cycle. They should be released to updates in the next week.

- Luke

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (43.3 KiB)

This bug was fixed in the package linux - 5.19.0-45.46

---------------
linux (5.19.0-45.46) kinetic; urgency=medium

  * kinetic/linux: 5.19.0-45.46 -proposed tracker (LP: #2023057)

  * Kinetic update: upstream stable patchset 2023-05-23 (LP: #2020599)
    - wifi: cfg80211: Partial revert "wifi: cfg80211: Fix use after free for wext"

linux (5.19.0-44.45) kinetic; urgency=medium

  * kinetic/linux: 5.19.0-44.45 -proposed tracker (LP: #2019827)

  * Linux 5.19 amdgpu: NULL pointer on GCN2 and invalid load on GCN1
    (LP: #2018470)
    - drm/amdgpu: Fix for BO move issue

  * CVE-2023-32233
    - netfilter: nf_tables: deactivate anonymous set from preparation phase

  * CVE-2023-2612
    - SAUCE: shiftfs: prevent lock unbalance in shiftfs_create_object()

  * CVE-2023-31436
    - net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg

  * CVE-2023-1380
    - wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies()

  * conntrack mark is not advertised via netlink (LP: #2016269)
    - netfilter: ctnetlink: revert to dumping mark regardless of event type

  * 5.19 not reporting cgroups v1 blkio.throttle.io_serviced (LP: #2016186)
    - SAUCE: blk-throttle: Fix io statistics for cgroup v1

  * [SRU] Backport request for hpwdt from upstream 6.1 to Jammy (LP: #2008751)
    - watchdog/hpwdt: Enable HP_WATCHDOG for ARM64 systems.
    - watchdog/hpwdt: Include nmi.h only if CONFIG_HPWDT_NMI_DECODING
    - [Config] Add arm64 option to CONFIG_HP_WATCHDOG

  * vmwgfx fails to reserve graphics buffer on aarch64 leading to blank display
    (LP: #2007001)
    - SAUCE: Revert "video/aperture: Disable and unregister sysfb devices via
      aperture helpers"

  * Ubuntu 22.04 raise abnormal NIC MSI-X requests with larger CPU cores (256)
    (LP: #2012335)
    - ice: Allow operation with reduced device MSI-X

  * Dell: Enable speaker mute hotkey LED indicator (LP: #2015972)
    - platform/x86: dell-laptop: Register ctl-led for speaker-mute

  * [SRU]With "Performance per Watt (DAPC)" enabled in the BIOS, Bootup time is
    taking longer than expected (LP: #2008527)
    - cpufreq: ACPI: Defer setting boost MSRs

  * [SRU][Jammy] CONFIG_PCI_MESON is not enabled (LP: #2007745)
    - [Config] arm64: Enable PCI_MESON module

  * Kinetic update: upstream stable patchset 2023-05-08 (LP: #2018948)
    - HID: asus: use spinlock to protect concurrent accesses
    - HID: asus: use spinlock to safely schedule workers
    - powerpc/mm: Rearrange if-else block to avoid clang warning
    - ARM: OMAP2+: Fix memory leak in realtime_counter_init()
    - arm64: dts: qcom: qcs404: use symbol names for PCIe resets
    - arm64: dts: qcom: msm8996-tone: Fix USB taking 6 minutes to wake up
    - arm64: dts: qcom: sm8150-kumano: Panel framebuffer is 2.5k instead of 4k
    - arm64: dts: qcom: sm6125: Reorder HSUSB PHY clocks to match bindings
    - arm64: dts: imx8m: Align SoC unique ID node unit address
    - ARM: zynq: Fix refcount leak in zynq_early_slcr_init
    - arm64: dts: mediatek: mt8183: Fix systimer 13 MHz clock description
    - arm64: dts: qcom: sdm845-db845c: fix audio codec interrupt pin name
    - arm64: dts: qcom: sc7180: correct SPMI bus addres...

Changed in linux (Ubuntu Kinetic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (72.7 KiB)

This bug was fixed in the package linux - 6.2.0-23.23

---------------
linux (6.2.0-23.23) lunar; urgency=medium

  * lunar/linux: 6.2.0-23.23 -proposed tracker (LP: #2019845)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - debian/dkms-versions -- update from kernel-versions (main/2023.05.15)

  * Fix flicker display problem on some panels which support PSR2 (LP: #2002968)
    - drm/i915/psr: Add continuous full frame bit together with single

  * Kernel 6.1 bumped the disk consumption on default images by 15%
    (LP: #2015867)
    - [Packaging] introduce a separate linux-lib-rust package

  * Update I915 PSR calculation on Linux 6.2 (LP: #2018655)
    - drm/i915: Fix fast wake AUX sync len
    - drm/i915: Explain the magic numbers for AUX SYNC/precharge length

  * Computer with Intel Atom CPU will not boot with Kernel 6.2.0-20
    (LP: #2017444)
    - [Config]: Disable CONFIG_INTEL_ATOMISP

  * udev fails to make prctl() syscall with apparmor=0 (as used by maas by
    default) (LP: #2016908)
    - SAUCE: (no-up) Stacking v38: Fix prctl() syscall with apparmor=0

  * CVE-2023-32233
    - netfilter: nf_tables: deactivate anonymous set from preparation phase

  * CVE-2023-2612
    - SAUCE: shiftfs: prevent lock unbalance in shiftfs_create_object()

  * CVE-2023-31436
    - net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg

  * CVE-2023-1380
    - wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies()

  * 5.19 not reporting cgroups v1 blkio.throttle.io_serviced (LP: #2016186)
    - SAUCE: blk-throttle: Fix io statistics for cgroup v1

  * LSM stacking and AppArmor for 6.2: additional fixes (LP: #2017903)
    - SAUCE: (no-up) apparmor: fix policy_compat perms remap for file dfa
    - SAUCE: (no-up) apparmor: fix profile verification and enable it
    - SAUCE: (no-up) apparmor: fix: add missing failure check in
      compute_xmatch_perms
    - SAUCE: (no-up) apparmor: fix: kzalloc perms tables for shared dfas

  * Lunar update: v6.2.12 upstream stable release (LP: #2017219)
    - Revert "pinctrl: amd: Disable and mask interrupts on resume"
    - drm/amd/display: Pass the right info to drm_dp_remove_payload
    - drm/i915: Workaround ICL CSC_MODE sticky arming
    - ALSA: emu10k1: fix capture interrupt handler unlinking
    - ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard
    - ALSA: i2c/cs8427: fix iec958 mixer control deactivation
    - ALSA: hda: patch_realtek: add quirk for Asus N7601ZM
    - ALSA: hda/realtek: Add quirks for Lenovo Z13/Z16 Gen2
    - ALSA: firewire-tascam: add missing unwind goto in
      snd_tscm_stream_start_duplex()
    - ALSA: emu10k1: don't create old pass-through playback device on Audigy
    - ALSA: hda/sigmatel: fix S/PDIF out on Intel D*45* motherboards
    - ALSA: hda/hdmi: disable KAE for Intel DG2
    - Bluetooth: L2CAP: Fix use-after-free in l2cap_disconnect_{req,rsp}
    - Bluetooth: Fix race condition in hidp_session_thread
    - bluetooth: btbcm: Fix logic error in forming the board name.
    - Bluetooth: Free potentially unfreed SCO connection
    - Bluetooth: hci_conn: Fix possible UAF
    - btrfs: restore the thread_pool=...

Changed in linux (Ubuntu Lunar):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/6.2.0-1009.9 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar' to 'verification-done-lunar'. If the problem still exists, change the tag 'verification-needed-lunar' to 'verification-failed-lunar'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-azure verification-needed-lunar
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.