Enable lowlatency settings in the generic kernel

Bug #2051342 reported by Andrea Righi
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Noble
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

Ubuntu provides the "lowlatency" kernel: a kernel optimized for applications that have special "low latency" requirements.

Currently, this kernel does not include any specific UBUNTU SAUCE patches to improve the extra "low latency" requirements, but the only difference is a small subset of .config options.

Almost all these options are now configurable either at boot-time or even at run-time, with the only exception of CONFIG_HZ (250 in the generic kernel vs 1000 in the lowlatency kernel).

Maintaining a separate kernel for a single config option seems a bit overkill and it is a significant cost of engineering hours, build time, regression testing time and resources. Not to mention the risk of the low-latency kernel falling behind and not being perfectly in sync with the latest generic kernel.

Enabling the low-latency settings in the generic kernel has been evaluated before, but it has been never finalized due to the potential risk of performance regressions in CPU-intensive applications (increasing HZ from 250 to 1000 may introduce more kernel jitter in number crunching workloads). The outcome of the original proposal resulted in a re-classification of the lowlatency kernel as a desktop-oriented kernel, enabling additional low latency features (LP: #2023007).

As we are approaching the release of the new Ubuntu 24.04 we may want to re-consider merging the low-latency settings in the generic kernel again.

Following a detailed analysis of the specific low-latency features:

- CONFIG_NO_HZ_FULL=y: enable access to "Full tickless mode" (shutdown clock tick when possible across all the enabled CPUs if they are either idle or running 1 task - reduce kernel jitter of running tasks due to the periodic clock tick, must be enabled at boot time passing `nohz_full=<cpu_list>`); this can actually help CPU-intensive workloads and it could provide much more benefits than the CONFIG_HZ difference (since it can potentially shutdown any kernel jitter on specific CPUs), this one should really be enabled anyway, considering that it is configurable at boot time

 - CONFIG_RCU_NOCB_CPU=y: move RCU callbacks from softirq context to kthread context (reduce time spent in softirqs with preemption disabled to improve the overall system responsiveness, at the cost of introducing a potential performance penalty, because RCU callbacks are not processed by kernel threads); this should be enabled as well, since it is configurable at boot time (via the rcu_nocbs=<cpu_list> parameter)

 - CONFIG_RCU_LAZY=y: batch RCU callbacks and then flush them after a timed delay instead of executing them immediately (c'an provide 5~10% power-savings for idle or lightly-loaded systems, this is extremely useful for laptops / portable devices - https://<email address hidden>/); this has the potential to introduce significant performance regressions, but in the Noble kernel we already have a SAUCE patch that allows to enable/disable this option at boot time (see LP: #2045492), and by default it will be disabled (CONFIG_RCU_LAZY_DEFAULT_OFF=y)

 - CONFIG_HZ=1000 last but not least, the only option that is *only* tunable at compile time. As already mentioned there is a potential risk of regressions for CPU-intensive applications, but they can be mitigated (and maybe they could even outperformed) with NO_HZ_FULL. On the other hand, HZ=1000 can improve system responsiveness, that means most of the desktop and server applications will benefit from this (the largest part of the server workloads is I/O bound, more than CPU-bound, so they can benefit from having a kernel that can react faster at switching tasks), not to mention the benefit for the typical end users applications (gaming, live conferencing, multimedia, etc.).

With all of that in place we can provide a kernel that has the flexibility to be more responsive, more performant and more power efficient (therefore more "generic"), simply by tuning run-time and boot-time options.

Moreover, once these changes are applied we will be able to deprecate the lowlatency kernel, saving engineering time and also reducing power consumption (required to build the kernel and do all the testing).

Optionally, we can also provide optimal "lowlatency" settings as a user-space package that would set the proper options in the kernel boot command line (GRUB, or similar).

[Test case]

There are plenty of benchmarks that can prove the validity of each one of the setting mentioned above, providing huge benefits in terms of system responsive.

However, our main goal here is to mitigate as much as possible the risk of regression for CPU-intensive applications, so the test case should only be focused on this particular aspect, to evaluate the impact of this change in the worst case scenario.

Test case (CPU-intensive stress test):

 - stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief

Metrics:

 - measure the bogo ops printed to stdout (not a great metric for real-world applications, but in this case it can show the impact of the additional kernel jitter introduced by the different CONFIG_HZ)

Results (linux-unstable 6.8.0-2.2, avg of 10 runs of 5min each):

 - CONFIG_HZ=250 : 17415.60 bogo ops/s
 - CONFIG_HZ=1000 : 14866.05 bogo ops/s
 - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s

Results confirm the theory about the performance drop of CPU-intensive workloads (-~14%), but also confirms the benefit of NO_HZ_FULL (+~6%) compared to the current HZ settings.

Let's also keep in mind that this is the worst case scenario and a very specific one, where only HPC / scientific applications can be affected, and even in this case we can always compensate and actually get a better level performance exploiting the nohz_full capability.

[Fix]

Enable the .config options mentioned above in the generic kernel (only on amd64 and arm64 for now).

[Regression potential]

As already covered we may experience performance regressions in CPU-intensive (number crunching) applications (such as HPC for example), but they can be compensated by the NO_HZ_FULL boot-time option.

Andrea Righi (arighi)
description: updated
Andrea Righi (arighi)
description: updated
Andrea Righi (arighi)
description: updated
Andrea Righi (arighi)
description: updated
Andrea Righi (arighi)
description: updated
description: updated
description: updated
description: updated
Andrea Righi (arighi)
description: updated
Revision history for this message
Doug Smythies (dsmythies) wrote :

I found this bug report by accident, while searching for something else.
I pretty much only use mainline kernels and only 1000 Hertz.
I support this proposed default Ubuntu kernel configuration change.

The tick ISR is incredibly efficient (less than 2 uSec on my test system), and I do not understand your test results as I would have expected a lot less difference. Using mainline kernel 6.8-rc1 and limiting my test system max CPU frequency so that I get similar bogo op/s as you I get:

 - CONFIG_HZ=250 : 14853.14 bogo ops/s
 - CONFIG_HZ=1000 : 14714.01 bogo ops/s (0.94% worse)
 - CONFIG_HZ=1000+nohz_full : 15100 bogo ops/s (1.6% better)

There is no power or thermal or active cores throttling on my test system.
Note: with my CPU frequency limited for this test, the tick ISR takes about 4 uSec.
The difference between the 250 and 1000 Hz kernel tests is 750 tick interrupts per second or 3 milliseconds or about 0.3%

If I do not limit my max CPU frequency I get:

 - CONFIG_HZ=250 : 38518.64 bogo ops/s
 - CONFIG_HZ=1000 : 37765.55 bogo ops/s (2.0% worse)
 - CONFIG_HZ=1000+nohz_full : 39391.32 bogo ops/s (2.3% better)

There was no power or thermal or active cores throttling for this test. However I did have to raise my processor max temperature limit from 75 to 80 degrees C for this test. Also the power is very close to the limit at 122 watts, where it will throttle at 125 watts.

Revision history for this message
Doug Smythies (dsmythies) wrote :

Before my post yesterday, I had never used the stress-ng utility, and only did so to repeat the originally posted test case. However, there was run to run variability with the stress-ng test that I could not understand for a 100% user, 0% system, type program. I decided to retest using one my own CPU loading programs. I also did only 1 thread and forced CPU affinity.

I also tried to get a more accurate estimate of the tick ISR execution time. Trace only does time stamping to the microsecond, making it difficult. For the 1000Hz kernel and a 100 second trace, 100000 tick ISRs were captured. min 0, average 0.7, max 2 uSec.

250Hz kernel test time (less is better): 300.14 seconds (100% user 0% system).

For the 1000 Hertz kernel the extra execution time prediction is:
0.7 uSec/tick * 750 extra ticks/sec * 300.14 sec = 0.16 sec
For a predicted execution time of 300.30 seconds.

1000Hz kernel test time: 300.32 seconds.

1000Hz+nohz_full test time: 300.08 seconds.

In an attempt to get a better model the processor CPU frequency was changed from 4.8 GHz to 0.80 GHz.
Tick ISR: min 3, average 4.0, max 11 uSec.
Note: of the 100000 tick ISRs captured the 11 uSec one was a one time outlier.

250Hz kernel test time: 429.25

For the 1000 Hertz kernel the extra execution time prediction is:
4.0 uSec/tick * 750 extra ticks/sec * 429.25 sec = 1.29 sec
For a predicted execution time of 430.53 seconds.

1000Hz kernel test time: 430.38 seconds.

Revision history for this message
Colin Ian King (colin-king) wrote (last edit ):

It may be worth trying a wider range of synthetic benchmarks to see how it affects scheduling, I/O, RCU and power consumption.

Revision history for this message
Andrea Righi (arighi) wrote :

@dsmythies thank you so much for sharing the results of your tests, really useful info!

I'm planning to do more tests setting the performance governor, I've been doing my initial tests only with the default Ubuntu settings, that means with the "Balanced" power mode enabled (that I think it's using intel p-states), so that may have affected my results.

I'm also curious to measure the power consumption of 250 vs 1000, since I expect to see a little extra power consumption with HZ=1000. But then I also want to enable the lazy RCU (boot with `rcu_nocbs=all rcutree.enable_rcu_lazy=1`) and see how much power we can save vs the impact on performance.

Overall, the point that I want to prove is that, yes, we may have small regressions here and there, but with these changes in place we can provide a huge flexibility for users, that will able to tune their system for improved responsiveness, CPU throughput, or power consumption, all using the default stock kernel.

Revision history for this message
Andrea Righi (arighi) wrote :

@colin-king thanks! Any suggestion in particular? I was thinking to lmbench, netperf and fio.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Andrea, that's a good start, but it may be worth running some of the Phoronix Tests too as they are a good spread of use cases.

Revision history for this message
Colin Ian King (colin-king) wrote :

Looks like Michael Larabel has done some analysis for you already :-) https://www.phoronix.com/news/Ubuntu-Generic-LL-Kernel

Revision history for this message
Andrea Righi (arighi) wrote :

@colin-king noticed, I just left a thank you message in the article. I'll still do the tests, but it's nice to see that someone else is contributing to this!

Another thing that I'd like to measure is to bpftrace the time spent in the tick handler before and after these changes applied, because we call the tick handler more often with HZ=1000, so it'd be nice to see how much is the distribution of such overhead.

And another cool thing that I've been told is that HZ=1000 can actually help in terms of power consumption, because apparently the CPUs have more chances to enter the idle states more quickly. This is also something that I'd like to measure.

Revision history for this message
Lastique (andysem) wrote :

I couldn't find this in the benchmark description on Phoronix, so I'm assuming the lowlatency kernel was booted with default parameters. Which means CONFIG_NO_HZ_FULL basically had no effect. This is probably fair for most users who won't specify `nohz_full` kernel parameter and will observe the performance difference between generic and lowlatency as shown by Phoronix tests.

But the claim in this ticket is that `nohz_full` can potentially win back some performance losses caused by CONFIG_HZ=1000. It would be useful if testing could confirm or disprove that. Ideally, Phoronix Test Suite would need to be run on generic, lowlatency without any extra kernel parameters and lowlatency with `nohz_full` parameter.

Revision history for this message
Andrea Righi (arighi) wrote :

@andysem correct, without `nohz_full` specified at boot CONFIG_NO_HZ_FULL has no effect, except for the little extra overhead that it's adding to the tick handler (there is still some overhead with this option enabled, even if it's not used). That's why I'd like to measure the time spent in some of the tick callbacks (using bpftrace) with generic vs lowlatency, to have a better understanding of this extra overhead.

Then, yes, repeating the tests disabling the ticks on some CPUs (booting with nohz_full) would be another interesting metric.

Revision history for this message
Doug Smythies (dsmythies) wrote :

For the Stress-NG 0.16.04: pts/stress-ng-1.11.0 [Test: Socket Activity] I get:

6633.31 Bogo Ops/s on a 1000Hz kernel. 0.9% improvement.
6572.92 Bogo Ops/s on a 250 Hz kernel.

I did this in a hurry, and will re-test tonight or tomorrow.

Revision history for this message
Lastique (andysem) wrote :

While testing locally, I have found this problem: https://bugs.launchpad.net/ubuntu/+source/linux-signed-lowlatency-hwe-6.5/+bug/2051733

If this is indeed a valid bug and not something weirdly specific to my system, I'd say `nohz_full` is non-functional.

Revision history for this message
pauldoo (paul-richards) wrote :

The generic Ubuntu kernel has dynamic preempt enabled. This allows the preemption model to be changed at runtime between: none, voluntary (the default), and full.

It would be super interesting to test what impact this has for latency sensitive workloads, and whether this can help you make the lowlatency kernel build redundant.

Related discussion on preemption model for Fedora: https://pagure.io/fedora-workstation/issue/228

Revision history for this message
Doug Smythies (dsmythies) wrote (last edit ):

I redid the Phoronix Stress-NG 0.16.04: Socket Activity test, the one that showed such a dramatic difference in their test. I increased the number of test runs from 3 to 10 and time per run from 30 to 60 seconds.

I got:
250Hz kernel (generic): 6608.74 Bogo Op/s, Deviation 0.56%
1000Hz kernel (lowlatency): 6591.27 BOgo Op/s, Deviation 0.39%, 0.3% performance degradation
My kernel was mainline 6.8-rc1 using Ubuntu kernel configurations.

I posted about this on the Phoronix thread.

By the way, for readers that do not have the Phoronix stuff installed, I think the equivalent direct stress-ng command is:

stress-ng --sock -1 --no-rand-seed --sock-zerocopy --timeout 2m --metrics-brief

Revision history for this message
Lastique (andysem) wrote :

@arighi: Did you consider applying lowlatency settings to generic kernel but keeping CONFIG_HZ=250?

I suspect, the majority of desktop usage issues (i.e. responsiveness) would be solved by `preempt=full rcu_nocbs=all` being the default (or whatever the equivalent boot options are). While keeping CONFIG_HZ=250 would mitigate the potential throughput loss.

Revision history for this message
Andrea Righi (arighi) wrote :

@andysem yes that is also another possibility. The idea is to do a lot of tests and if we find that there's a remote possibility to introduce significant performance regressions in certain cases we can still keep HZ=250, but definitely go with the other options.

Revision history for this message
Andrea Righi (arighi) wrote :
Revision history for this message
Shane Turner (turner) wrote :

I haven’t spotted any tests that show a workload with a significant improvement shown here, on discourse, or Phoronix. Am I incorrect? Please keep in mind that I’m not suggesting that this initiative not move forward as there appears to be plenty of evidence being built up suggesting that there is no significant regressions.

Revision history for this message
Andrea Righi (arighi) wrote :

@turner basically any type of I/O workload can benefit from the HZ=1000 change, I haven't posted any test result, because in the scope of this proposal I wanted to focus mainly at the regression potential for CPU-intensive workloads and the benefits in terms of power consumption (this one as a positive side effect).

I will also post some numbers to better highlight the benefits of the low latency features.

Andrea Righi (arighi)
Changed in linux (Ubuntu Noble):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 6.8.0-20.20

---------------
linux (6.8.0-20.20) noble; urgency=medium

  * noble/linux: 6.8.0-20.20 -proposed tracker (LP: #2058221)

  * Noble update: v6.8.1 upstream stable release (LP: #2058224)
    - x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
    - Documentation/hw-vuln: Add documentation for RFDS
    - x86/rfds: Mitigate Register File Data Sampling (RFDS)
    - KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
    - Linux 6.8.1

  * Autopkgtest failures on amd64 (LP: #2048768)
    - [Packaging] update to clang-18

  * Miscellaneous Ubuntu changes
    - SAUCE: apparmor4.0.0: LSM stacking v39: fix build error with
      CONFIG_SECURITY=n
    - [Config] amd64: MITIGATION_RFDS=y

 -- Paolo Pisati <email address hidden> Mon, 18 Mar 2024 11:08:14 +0100

Changed in linux (Ubuntu Noble):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.