[Openmp-commits] [openmp] [OpenMP] Fix hyper barrier performance issue (PR #195473)
via Openmp-commits
openmp-commits at lists.llvm.org
Thu May 7 01:52:55 PDT 2026
jprotze wrote:
I tried to reproduce the problem on one of our cluster nodes (Intel(R) Xeon(R) Platinum 8468, dualsocket, 96 cores, HT off). My default configuration has `OMP_PLACES=cores OMP_PROC_BIND=close`, because that's optimal for most of my workloads. With proc-bind, I could not reproduce the described performance difference between active/passive wait.
When setting `OMP_PROC_BIND=false`, I could see a significant performance drop for all policies - which is why we (in HPC) always recommend to pin threads/processes to cores for performance.
The main difference in performance between GGC and Clang I saw in my experiments is a result from GCC using ~4% less instructions (30-65% more branches for Clang), partially compensated by a slightly higher IPC.
GCC/14.3.0:
```
17.656,05 msec task-clock:u # 73,759 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
40.653 page-faults:u # 2,302 K/sec
52.982.785.482 cycles:u # 3,001 GHz
146.792.687.082 instructions:u # 2,77 insn per cycle
11.920.113.395 branches:u # 675,129 M/sec
238.728.687 branch-misses:u # 2,00% of all branches
```
Clang/HEAD, no `OMP_WAIT_POLICY`:
```
18.981,83 msec task-clock:u # 58,200 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
41.097 page-faults:u # 2,165 K/sec
57.265.771.521 cycles:u # 3,017 GHz
162.841.765.848 instructions:u # 2,84 insn per cycle
18.606.637.259 branches:u # 980,234 M/sec
237.372.710 branch-misses:u # 1,28% of all branches
```
Clang/HEAD, `OMP_WAIT_POLICY=passive`:
```
15.524,58 msec task-clock:u # 54,839 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
40.926 page-faults:u # 2,636 K/sec
46.758.571.626 cycles:u # 3,012 GHz
153.386.014.126 instructions:u # 3,28 insn per cycle
15.624.665.101 branches:u # 1,006 G/sec
238.493.322 branch-misses:u # 1,53% of all branches
```
Clang/HEAD, `OMP_WAIT_POLICY=active`:
```
20.367,46 msec task-clock:u # 73,704 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
41.122 page-faults:u # 2,019 K/sec
61.551.363.097 cycles:u # 3,022 GHz
167.424.100.175 instructions:u # 2,72 insn per cycle
19.765.132.310 branches:u # 970,427 M/sec
239.176.931 branch-misses:u # 1,21% of all branches
```
The 1.5x higher number of branches should not cause a performance difference (other then contributing to the instruction count), since the number of branch-misses is the same.
Instructions for active vs. passive are a result of spin-wait.
Wallclock time is in all configurations ~0.27 seconds.
As Terry mentioned, in multi-user/multi-workload environments, burning cpu cycles for no performance gain, while other processes could use these cycles, is not the preferred strategy. In my experiment, explicitly setting the policy to passive actually reduced the cpu load compared to not setting the policy.
I tried to compare with icpx results, but latest icpx fails during cmake in try_compile with a compiler crash, when `-fiopenmp/-qopenmp` is used, and OpenMP gets disabled. When forcing `-DOpenMP_CXX_FLAGS=-fopenmp`, perf stats are the similar as for clang/HEAD.
https://github.com/llvm/llvm-project/pull/195473
More information about the Openmp-commits
mailing list