Answering question 2. I have done a comprehensive performance analysis based on the benchmark application. Note: The SRU changes how the sys_membarrier syscall is used. The implementation that we want to change to in this SRU never blocks, while the previous implementation does. This makes performance analysis entirely workload dependant. On busy servers with lots of background processes, sys_membarrier will block more often, compared to quiet servers with no background processes. The following is based on a quiet server with no background processes. Test parameters =============== Ubuntu 18.04.4 KVM, 2 vcpus 0.10.1 liburcu 4.15.0-99-generic Test program "test_urcu[_bp]": http://paste.ubuntu.com/p/5vXVycQjYk/ (only difference is #include or #include ) No changes to source code ========================= ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10 nr_reads 6065490002 nr_writes 237 nr_ops 6065490239 nr_reads 6476219475 nr_writes 186 nr_ops 6476219661 nr_reads 6474789528 nr_writes 183 nr_ops 6474789711 nr_reads 6476326433 nr_writes 188 nr_ops 6476326621 nr_reads 6479298142 nr_writes 179 nr_ops 6479298321 nr_reads 6476429569 nr_writes 186 nr_ops 6476429755 nr_reads 6478019994 nr_writes 191 nr_ops 6478020185 nr_reads 6479117595 nr_writes 183 nr_ops 6479117778 nr_reads 6478302181 nr_writes 185 nr_ops 6478302366 nr_reads 6481003399 nr_writes 191 nr_ops 6481003590 ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10 nr_reads 644339902 nr_writes 485 nr_ops 644340387 nr_reads 644092800 nr_writes 1101 nr_ops 644093901 nr_reads 644676446 nr_writes 494 nr_ops 644676940 nr_reads 643845915 nr_writes 500 nr_ops 643846415 nr_reads 645156053 nr_writes 502 nr_ops 645156555 nr_reads 644626421 nr_writes 497 nr_ops 644626918 nr_reads 644710679 nr_writes 495 nr_ops 644711174 nr_reads 644445530 nr_writes 503 nr_ops 644446033 nr_reads 645150707 nr_writes 497 nr_ops 645151204 nr_reads 643681268 nr_writes 496 nr_ops 643681764 Commits c0bb9f and 374530 patched in ==================================== ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10 nr_reads 4097663510 nr_writes 6516 nr_ops 4097670026 nr_reads 4177088332 nr_writes 4183 nr_ops 4177092515 nr_reads 4153780077 nr_writes 1907 nr_ops 4153781984 nr_reads 4150954044 nr_writes 3942 nr_ops 4150957986 nr_reads 4267855073 nr_writes 2102 nr_ops 4267857175 nr_reads 4131310825 nr_writes 7119 nr_ops 4131317944 nr_reads 4183771431 nr_writes 1919 nr_ops 4183773350 nr_reads 4270944170 nr_writes 4958 nr_ops 4270949128 nr_reads 4123277225 nr_writes 4228 nr_ops 4123281453 nr_reads 4266997284 nr_writes 1723 nr_ops 4266999007 ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10 nr_reads 6530208343 nr_writes 8860 nr_ops 6530217203 nr_reads 6514357222 nr_writes 10568 nr_ops 6514367790 nr_reads 6517420660 nr_writes 9534 nr_ops 6517430194 nr_reads 6510005433 nr_writes 11799 nr_ops 6510017232 nr_reads 6492226563 nr_writes 12517 nr_ops 6492239080 nr_reads 6532405460 nr_writes 6548 nr_ops 6532412008 nr_reads 6514205150 nr_writes 9686 nr_ops 6514214836 nr_reads 6481643486 nr_writes 16167 nr_ops 6481659653 nr_reads 6509268022 nr_writes 10582 nr_ops 6509278604 nr_reads 6523168701 nr_writes 9066 nr_ops 6523177767 Comparing and contrasting with 20.04: ===================================== Test Parameters: ================ Ubuntu 20.04 LTS KVM, 2 vcpus 0.11.1 liburcu 5.4.0-29-generic ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10 nr_reads 4270089636 nr_writes 1638 nr_ops 4270091274 nr_reads 4281598850 nr_writes 3008 nr_ops 4281601858 nr_reads 4241230576 nr_writes 3612 nr_ops 4241234188 nr_reads 4230643208 nr_writes 5367 nr_ops 4230648575 nr_reads 4333495124 nr_writes 1354 nr_ops 4333496478 nr_reads 4291295097 nr_writes 3545 nr_ops 4291298642 nr_reads 4232582737 nr_writes 1983 nr_ops 4232584720 nr_reads 4268926719 nr_writes 3363 nr_ops 4268930082 nr_reads 4266736459 nr_writes 4881 nr_ops 4266741340 nr_reads 4313525276 nr_writes 4549 nr_ops 4313529825 ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10 nr_reads 6848011482 nr_writes 3171 nr_ops 6848014653 nr_reads 6842990129 nr_writes 4577 nr_ops 6842994706 nr_reads 6862298832 nr_writes 2875 nr_ops 6862301707 nr_reads 6849848255 nr_writes 4292 nr_ops 6849852547 nr_reads 6846387545 nr_writes 4975 nr_ops 6846392520 nr_reads 6860547626 nr_writes 3376 nr_ops 6860551002 nr_reads 6853028794 nr_writes 2784 nr_ops 6853031578 nr_reads 6846021299 nr_writes 3383 nr_ops 6846024682 nr_reads 6833359957 nr_writes 5917 nr_ops 6833365874 nr_reads 6851224193 nr_writes 2432 nr_ops 6851226625 Comparing and contrasting with 14.04: ===================================== Test Parameters: ================ Ubuntu 14.04.6 LTS KVM, 2 vcpus 0.7.12 liburcu 3.13.0-170-generic ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu 6 2 10 nr_reads 284080749 nr_writes 790657 nr_ops 284871406 nr_reads 283785838 nr_writes 647058 nr_ops 284432896 nr_reads 273424217 nr_writes 1535098 nr_ops 274959315 nr_reads 283550711 nr_writes 1442548 nr_ops 284993259 nr_reads 282557773 nr_writes 946106 nr_ops 283503879 nr_reads 286811777 nr_writes 837176 nr_ops 287648953 nr_reads 273278986 nr_writes 1738549 nr_ops 275017535 nr_reads 287141686 nr_writes 652772 nr_ops 287794458 nr_reads 287697411 nr_writes 982440 nr_ops 288679851 nr_reads 281468419 nr_writes 830736 nr_ops 282299155 ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu_bp 6 2 10 nr_reads 670447719 nr_writes 16731 nr_ops 670464450 nr_reads 670464435 nr_writes 9970 nr_ops 670474405 nr_reads 670235233 nr_writes 4932 nr_ops 670240165 nr_reads 670853867 nr_writes 6845 nr_ops 670860712 nr_reads 670970962 nr_writes 307 nr_ops 670971269 nr_reads 670346111 nr_writes 8161 nr_ops 670354272 nr_reads 669748209 nr_writes 6824 nr_ops 669755033 nr_reads 671242419 nr_writes 249 nr_ops 671242668 nr_reads 670318007 nr_writes 8990 nr_ops 670326997 nr_reads 669872685 nr_writes 269 nr_ops 669872954 Analysis ======== We see from the two Bionic tests, we see the nr_ops go from 6065490239 to 4097670026 for test_urcu from unpatched to patched. This is a 1/3 performance impairment, numbers wise. However, if you compare with the numbers from Focal, we see the results are in line with what you would expect if you were running Focal, with 4097670026 vs 4270091274. For test_urcu_bp, the two Bionic tests show a dramatic difference. We go from 644340387 nr_ops for unpatched to 6530217203 nr_ops, which is a 10x improvement. These numbers are in line with what you would expect on Focal, with 6848014653 operations. Comparing to Trusty, we see a wide performance improvement all around. The next question is, is this benchmark an appropriate demonstration of performance? Since the SRU is about changing the sys_membarrier syscall command options, we should really be profiling based on the performance of the syscall, as this indicates actual performance in real workloads, since we block on sys_membarrier in the unpatched version, we would expect the syscall to be invoked less. Perf Performance Analysis on "sys_enter_membarrier" Tracepoint ============================================================== No changes to source code ========================= # perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10 nr_reads 5641721906 nr_writes 932 nr_ops 5641722838 607 syscalls:sys_enter_membarrier nr_reads 6168632959 nr_writes 248 nr_ops 6168633207 595 syscalls:sys_enter_membarrier nr_reads 6481069225 nr_writes 185 nr_ops 6481069410 567 syscalls:sys_enter_membarrier # perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10 nr_reads 644124499 nr_writes 501 nr_ops 644125000 1 syscalls:sys_enter_membarrier nr_reads 646275413 nr_writes 2287 nr_ops 646277700 1 syscalls:sys_enter_membarrier nr_reads 644021303 nr_writes 494 nr_ops 644021797 1 syscalls:sys_exit_membarrier Commits c0bb9f and 374530 patched in ==================================== # perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10 nr_reads 4322995476 nr_writes 3320 nr_ops 4322998796 835874 syscalls:sys_enter_membarrier nr_reads 4210380395 nr_writes 2206 nr_ops 4210382601 883042 syscalls:sys_enter_membarrier nr_reads 4233636203 nr_writes 3280 nr_ops 4233639483 867184 syscalls:sys_enter_membarrier # perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10 nr_reads 6539807379 nr_writes 5289 nr_ops 6539812668 10578 syscalls:sys_enter_membarrier nr_reads 6500401303 nr_writes 13287 nr_ops 6500414590 26574 syscalls:sys_enter_membarrier nr_reads 6518640060 nr_writes 8780 nr_ops 6518648840 17560 syscalls:sys_enter_membarrier Analysis ======== Now, this is some interesting data. Initially, with unchanged Bionic source code, we see 607 sys_membarrier syscalls in 10 seconds for test_urcu, and 1 sys_membarrier syscall for test_urcu_bp. In reality, this is actually 0 syscalls, not 1, due to commit [1]: 64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 which removes the use of sys_membarrier for urcu-bp due to major performance problems blocking syscalls have in ltt-ng. [1] https://github.com/urcu/userspace-rcu/commit/64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 (note this was backported to 0.10.1 stable release, and is in Bionic) Looking at the patched versions, we see test_urcu syscall count to sys_membarrier skyrockets to 835874, a whopping 1377x increase. We went from 60 syscalls/sec to 83587 syscalls/sec, which more or less demonstrates that the patched liburcu spent less time in kernel space, as syscalls did not block, and exited quickly. The patches re-enable the use of sys_membarrier in the urcu-bp variant, and we see the number of times the syscall was called was in the order of magnitude of 10,000 - 20,000 times over 10 seconds. This is behind the massive 10x performance increase in the number of operations the test did, as it went from using userspace level memory barriers to kernel space membarrier syscalls, which are much faster. Conclusion ========== This SRU changes liburcu to use the MEMBARRIER_CMD_PRIVATE_EXPEDITED command of the sys_membarrier syscall, over the previous MEMBARRIER_CMD_SHARED command. MEMBARRIER_CMD_SHARED blocks as it must wait for all threads in the system to agree on the view of memory, while with MEMBARRIER_CMD_PRIVATE_EXPEDITED, only the threads in the local process need to agree, and MEMBARRIER_CMD_PRIVATE_EXPEDITED is guaranteed to never block. With the non-blocking behaviour, we see sys_membarrier operate much more quickly, and it can complete many more times per second than the previous implementation which blocks. For most workloads, not getting stuck on a blocking call to sys_membarrier should improve application performance, and while the benchmark programs do indicate a 1/3 drop in operations undertaken, in the normal urcu variant, the performance is in line with what you would expect from the current state of the art, in Focal. I believe this SRU is a net benefit to the performance to liburcu.