[Openmp-commits] [openmp] [OpenMP] Fix hyper barrier performance issue (PR #195473)

Thu May 7 01:52:55 PDT 2026

jprotze wrote:

I tried to reproduce the problem on one of our cluster nodes (Intel(R) Xeon(R) Platinum 8468, dualsocket, 96 cores, HT off). My default configuration has `OMP_PLACES=cores OMP_PROC_BIND=close`, because that's optimal for most of my workloads. With proc-bind, I could not reproduce the described performance difference between active/passive wait. 

When setting `OMP_PROC_BIND=false`, I could see a significant performance drop for all policies - which is why we (in HPC) always recommend to pin threads/processes to cores for performance.

The main difference in performance between GGC and Clang I saw in my experiments is a result from GCC using ~4% less instructions (30-65% more branches for Clang), partially compensated by a slightly higher IPC.
GCC/14.3.0:
```
         17.656,05 msec task-clock:u                     #   73,759 CPUs utilized             
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
            40.653      page-faults:u                    #    2,302 K/sec                     
    52.982.785.482      cycles:u                         #    3,001 GHz                       
   146.792.687.082      instructions:u                   #    2,77  insn per cycle            
    11.920.113.395      branches:u                       #  675,129 M/sec                     
       238.728.687      branch-misses:u                  #    2,00% of all branches           
```
Clang/HEAD, no `OMP_WAIT_POLICY`:
```
         18.981,83 msec task-clock:u                     #   58,200 CPUs utilized             
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
            41.097      page-faults:u                    #    2,165 K/sec                     
    57.265.771.521      cycles:u                         #    3,017 GHz                       
   162.841.765.848      instructions:u                   #    2,84  insn per cycle            
    18.606.637.259      branches:u                       #  980,234 M/sec                     
       237.372.710      branch-misses:u                  #    1,28% of all branches           
```
Clang/HEAD, `OMP_WAIT_POLICY=passive`:
```
         15.524,58 msec task-clock:u                     #   54,839 CPUs utilized             
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
            40.926      page-faults:u                    #    2,636 K/sec                     
    46.758.571.626      cycles:u                         #    3,012 GHz                       
   153.386.014.126      instructions:u                   #    3,28  insn per cycle            
    15.624.665.101      branches:u                       #    1,006 G/sec                     
       238.493.322      branch-misses:u                  #    1,53% of all branches           
```
Clang/HEAD, `OMP_WAIT_POLICY=active`:
```
         20.367,46 msec task-clock:u                     #   73,704 CPUs utilized             
                 0      context-switches:u               #    0,000 /sec                      
                 0      cpu-migrations:u                 #    0,000 /sec                      
            41.122      page-faults:u                    #    2,019 K/sec                     
    61.551.363.097      cycles:u                         #    3,022 GHz                       
   167.424.100.175      instructions:u                   #    2,72  insn per cycle            
    19.765.132.310      branches:u                       #  970,427 M/sec                     
       239.176.931      branch-misses:u                  #    1,21% of all branches           
```
The 1.5x higher number of branches should not cause a performance difference (other then contributing to the instruction count), since the number of branch-misses is the same.
Instructions for active vs. passive are a result of spin-wait.

Wallclock time is in all configurations ~0.27 seconds.

As Terry mentioned, in multi-user/multi-workload environments, burning cpu cycles for no performance gain, while other processes could use these cycles, is not the preferred strategy. In my experiment, explicitly setting the policy to passive actually reduced the cpu load compared to not setting the policy.

I tried to compare with icpx results, but latest icpx fails during cmake in try_compile with a compiler crash, when `-fiopenmp/-qopenmp` is used, and OpenMP gets disabled. When forcing `-DOpenMP_CXX_FLAGS=-fopenmp`, perf stats are the similar as for clang/HEAD.

https://github.com/llvm/llvm-project/pull/195473