Answering question 2. I have done a comprehensive performance analysis based on the benchmark application.

Note: The SRU changes how the sys_membarrier syscall is used. The implementation that we want to change to in this SRU never blocks, while the previous implementation does. This makes performance analysis entirely workload dependant. On busy servers with lots of background processes, sys_membarrier will block more often, compared to quiet servers with no background processes.

The following is based on a quiet server with no background processes.

Test parameters
===============
Ubuntu 18.04.4
KVM, 2 vcpus
0.10.1 liburcu
4.15.0-99-generic
Test program "test_urcu[_bp]": http://paste.ubuntu.com/p/5vXVycQjYk/
(only difference is #include <urcu.h> or #include <urcu-bp.h>)

No changes to source code
=========================

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   6065490002 nr_writes          237 nr_ops   6065490239
nr_reads   6476219475 nr_writes          186 nr_ops   6476219661
nr_reads   6474789528 nr_writes          183 nr_ops   6474789711
nr_reads   6476326433 nr_writes          188 nr_ops   6476326621
nr_reads   6479298142 nr_writes          179 nr_ops   6479298321
nr_reads   6476429569 nr_writes          186 nr_ops   6476429755
nr_reads   6478019994 nr_writes          191 nr_ops   6478020185
nr_reads   6479117595 nr_writes          183 nr_ops   6479117778
nr_reads   6478302181 nr_writes          185 nr_ops   6478302366
nr_reads   6481003399 nr_writes          191 nr_ops   6481003590

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads    644339902 nr_writes          485 nr_ops    644340387
nr_reads    644092800 nr_writes         1101 nr_ops    644093901
nr_reads    644676446 nr_writes          494 nr_ops    644676940
nr_reads    643845915 nr_writes          500 nr_ops    643846415
nr_reads    645156053 nr_writes          502 nr_ops    645156555
nr_reads    644626421 nr_writes          497 nr_ops    644626918
nr_reads    644710679 nr_writes          495 nr_ops    644711174
nr_reads    644445530 nr_writes          503 nr_ops    644446033
nr_reads    645150707 nr_writes          497 nr_ops    645151204
nr_reads    643681268 nr_writes          496 nr_ops    643681764

Commits c0bb9f and 374530 patched in
====================================

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   4097663510 nr_writes         6516 nr_ops   4097670026
nr_reads   4177088332 nr_writes         4183 nr_ops   4177092515
nr_reads   4153780077 nr_writes         1907 nr_ops   4153781984
nr_reads   4150954044 nr_writes         3942 nr_ops   4150957986
nr_reads   4267855073 nr_writes         2102 nr_ops   4267857175
nr_reads   4131310825 nr_writes         7119 nr_ops   4131317944
nr_reads   4183771431 nr_writes         1919 nr_ops   4183773350
nr_reads   4270944170 nr_writes         4958 nr_ops   4270949128
nr_reads   4123277225 nr_writes         4228 nr_ops   4123281453
nr_reads   4266997284 nr_writes         1723 nr_ops   4266999007


ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads   6530208343 nr_writes         8860 nr_ops   6530217203
nr_reads   6514357222 nr_writes        10568 nr_ops   6514367790
nr_reads   6517420660 nr_writes         9534 nr_ops   6517430194
nr_reads   6510005433 nr_writes        11799 nr_ops   6510017232
nr_reads   6492226563 nr_writes        12517 nr_ops   6492239080
nr_reads   6532405460 nr_writes         6548 nr_ops   6532412008
nr_reads   6514205150 nr_writes         9686 nr_ops   6514214836
nr_reads   6481643486 nr_writes        16167 nr_ops   6481659653
nr_reads   6509268022 nr_writes        10582 nr_ops   6509278604
nr_reads   6523168701 nr_writes         9066 nr_ops   6523177767


Comparing and contrasting with 20.04:
=====================================

Test Parameters:
================
Ubuntu 20.04 LTS
KVM, 2 vcpus
0.11.1 liburcu
5.4.0-29-generic

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu 6 2 10
nr_reads   4270089636 nr_writes         1638 nr_ops   4270091274
nr_reads   4281598850 nr_writes         3008 nr_ops   4281601858
nr_reads   4241230576 nr_writes         3612 nr_ops   4241234188
nr_reads   4230643208 nr_writes         5367 nr_ops   4230648575
nr_reads   4333495124 nr_writes         1354 nr_ops   4333496478
nr_reads   4291295097 nr_writes         3545 nr_ops   4291298642
nr_reads   4232582737 nr_writes         1983 nr_ops   4232584720
nr_reads   4268926719 nr_writes         3363 nr_ops   4268930082
nr_reads   4266736459 nr_writes         4881 nr_ops   4266741340
nr_reads   4313525276 nr_writes         4549 nr_ops   4313529825

ubuntu@ubuntu:~/userspace-rcu/tests/benchmark$ ./test_urcu_bp 6 2 10
nr_reads   6848011482 nr_writes         3171 nr_ops   6848014653
nr_reads   6842990129 nr_writes         4577 nr_ops   6842994706
nr_reads   6862298832 nr_writes         2875 nr_ops   6862301707
nr_reads   6849848255 nr_writes         4292 nr_ops   6849852547
nr_reads   6846387545 nr_writes         4975 nr_ops   6846392520
nr_reads   6860547626 nr_writes         3376 nr_ops   6860551002
nr_reads   6853028794 nr_writes         2784 nr_ops   6853031578
nr_reads   6846021299 nr_writes         3383 nr_ops   6846024682
nr_reads   6833359957 nr_writes         5917 nr_ops   6833365874
nr_reads   6851224193 nr_writes         2432 nr_ops   6851226625

Comparing and contrasting with 14.04:
=====================================

Test Parameters:
================
Ubuntu 14.04.6 LTS
KVM, 2 vcpus
0.7.12 liburcu
3.13.0-170-generic

ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu 6 2 10
nr_reads    284080749 nr_writes       790657 nr_ops    284871406
nr_reads    283785838 nr_writes       647058 nr_ops    284432896
nr_reads    273424217 nr_writes      1535098 nr_ops    274959315
nr_reads    283550711 nr_writes      1442548 nr_ops    284993259
nr_reads    282557773 nr_writes       946106 nr_ops    283503879
nr_reads    286811777 nr_writes       837176 nr_ops    287648953
nr_reads    273278986 nr_writes      1738549 nr_ops    275017535
nr_reads    287141686 nr_writes       652772 nr_ops    287794458
nr_reads    287697411 nr_writes       982440 nr_ops    288679851
nr_reads    281468419 nr_writes       830736 nr_ops    282299155

ubuntu@ubuntu:~/userspace-rcu/tests$ ./test_urcu_bp 6 2 10
nr_reads    670447719 nr_writes        16731 nr_ops    670464450
nr_reads    670464435 nr_writes         9970 nr_ops    670474405
nr_reads    670235233 nr_writes         4932 nr_ops    670240165
nr_reads    670853867 nr_writes         6845 nr_ops    670860712
nr_reads    670970962 nr_writes          307 nr_ops    670971269
nr_reads    670346111 nr_writes         8161 nr_ops    670354272
nr_reads    669748209 nr_writes         6824 nr_ops    669755033
nr_reads    671242419 nr_writes          249 nr_ops    671242668
nr_reads    670318007 nr_writes         8990 nr_ops    670326997
nr_reads    669872685 nr_writes          269 nr_ops    669872954

Analysis
========

We see from the two Bionic tests, we see the nr_ops go from 6065490239 to 4097670026 for test_urcu from unpatched to patched. This is a 1/3 performance impairment, numbers wise. However, if you compare with the numbers from Focal, we see the results are in line with what you would expect if you were running Focal, with 4097670026 vs 4270091274.

For test_urcu_bp, the two Bionic tests show a dramatic difference. We go from 644340387 nr_ops for unpatched to 6530217203 nr_ops, which is a 10x improvement. These numbers are in line with what you would expect on Focal, with 6848014653 operations.

Comparing to Trusty, we see a wide performance improvement all around.

The next question is, is this benchmark an appropriate demonstration of performance? Since the SRU is about changing the sys_membarrier syscall command options, we should really be profiling based on the performance of the syscall, as this indicates actual performance in real workloads, since we block on sys_membarrier in the unpatched version, we would expect the syscall to be invoked less.

Perf Performance Analysis on "sys_enter_membarrier" Tracepoint
==============================================================

No changes to source code
=========================

# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads   5641721906 nr_writes          932 nr_ops   5641722838
607      syscalls:sys_enter_membarrier
nr_reads   6168632959 nr_writes          248 nr_ops   6168633207
595      syscalls:sys_enter_membarrier
nr_reads   6481069225 nr_writes          185 nr_ops   6481069410
567      syscalls:sys_enter_membarrier
      
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads    644124499 nr_writes          501 nr_ops    644125000
1      syscalls:sys_enter_membarrier
nr_reads    646275413 nr_writes         2287 nr_ops    646277700
1      syscalls:sys_enter_membarrier
nr_reads    644021303 nr_writes          494 nr_ops    644021797
1      syscalls:sys_exit_membarrier
      
Commits c0bb9f and 374530 patched in
====================================

# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu 6 2 10
nr_reads   4322995476 nr_writes         3320 nr_ops   4322998796
835874      syscalls:sys_enter_membarrier
nr_reads   4210380395 nr_writes         2206 nr_ops   4210382601
883042      syscalls:sys_enter_membarrier
nr_reads   4233636203 nr_writes         3280 nr_ops   4233639483
867184      syscalls:sys_enter_membarrier
      
      
# perf stat -e 'syscalls:sys_enter_membarrier' -a ./test_urcu_bp 6 2 10
nr_reads   6539807379 nr_writes         5289 nr_ops   6539812668
10578      syscalls:sys_enter_membarrier
nr_reads   6500401303 nr_writes        13287 nr_ops   6500414590
26574      syscalls:sys_enter_membarrier
nr_reads   6518640060 nr_writes         8780 nr_ops   6518648840
17560      syscalls:sys_enter_membarrier

Analysis
========

Now, this is some interesting data. Initially, with unchanged Bionic source code, we see 607 sys_membarrier syscalls in 10 seconds for test_urcu, and 1 sys_membarrier syscall for test_urcu_bp. In reality, this is actually 0 syscalls, not 1, due to commit [1]: 64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6 which removes the use of sys_membarrier for urcu-bp due to major performance problems blocking syscalls have in ltt-ng.

[1] https://github.com/urcu/userspace-rcu/commit/64478021edcf7a5ac3bca3fa9e8b8108d2fbb9b6
(note this was backported to 0.10.1 stable release, and is in Bionic)

Looking at the patched versions, we see test_urcu syscall count to sys_membarrier skyrockets to 835874, a whopping 1377x increase. We went from 60 syscalls/sec to 83587 syscalls/sec, which more or less demonstrates that the patched liburcu spent less time in kernel space, as syscalls did not block, and exited quickly.

The patches re-enable the use of sys_membarrier in the urcu-bp variant, and we see the number of times the syscall was called was in the order of magnitude of 10,000 - 20,000 times over 10 seconds. This is behind the massive 10x performance increase in the number of operations the test did, as it went from using userspace level memory barriers to kernel space membarrier syscalls, which are much faster.

Conclusion
==========

This SRU changes liburcu to use the MEMBARRIER_CMD_PRIVATE_EXPEDITED command of the sys_membarrier syscall, over the previous MEMBARRIER_CMD_SHARED command.

MEMBARRIER_CMD_SHARED blocks as it must wait for all threads in the system to agree on the view of memory, while with MEMBARRIER_CMD_PRIVATE_EXPEDITED, only the threads in the local process need to agree, and MEMBARRIER_CMD_PRIVATE_EXPEDITED is guaranteed to never block.

With the non-blocking behaviour, we see sys_membarrier operate much more quickly, and it can complete many more times per second than the previous implementation which blocks.

For most workloads, not getting stuck on a blocking call to sys_membarrier should improve application performance, and while the benchmark programs do indicate a 1/3 drop in operations undertaken, in the normal urcu variant, the performance is in line with what you would expect from the current state of the art, in Focal.

I believe this SRU is a net benefit to the performance to liburcu.