[Openmp-commits] [openmp] [OpenMP] Fix hyper barrier performance issue (PR #195473)

Tue May 5 06:30:02 PDT 2026

kimwalisch wrote:

> Another option that just came to my mind is to investigate whether it is possible to fix this scaling issue in the default bp_hyper_bar barrier type. My feeling is that on many-core systems, by default threads erroneously do not spin at all when they encounter a barrier or lock. Based on your answer, it seems that such a fix would be more likely to be accepted.

> So I will try this first, and report back if I managed to do that.

With the help of an AI agent (Codex GPT-5.5 "Extra High") I have found and fixed this LLVM OpenMP performance issue in the default hyper barrier. The LLVM OpenMP documentation says: "With the default runtime settings, libomp uses the `bp_hyper_bar` fork/join barrier together with finite blocktime (`KMP_BLOCKTIME=200ms`, `KMP_LIBRARY=throughput`). ".

Hence, by default threads should wait/spin for up to 200ms before going to sleep. However, in the current implementation of the default hyper barrier threads generally go to sleep much quicker, sometimes not spinning at all. This causes the severe LLVM OpenMP performance issue I measured on many-core systems (with up to 500x more context switches).

I have hence discarded my previous code changes (use the `bp_dist_bar` barrier instead of `bp_hyper_bar`) and replaced them by a fix of the default hyper barrier type.

@TerryLWilmarth What do you think of my new pull request?

# Here is a detailed description of how this new hyper barrier performance fix

This patch fixes a libomp performance issue for short many-core workloads that repeatedly enter large OpenMP parallel regions.

With the default settings, libomp reports:

```text
OMP_WAIT_POLICY=PASSIVE
KMP_LIBRARY=throughput
KMP_BLOCKTIME=200ms
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
```

A natural expectation is that `KMP_BLOCKTIME=200ms` lets worker threads actively wait (spin) for roughly 200 ms before they go to sleep. For short OpenMP workloads, that should usually be enough to keep workers ready for the next parallel region.

However, in practice the default finite-blocktime wait path used by the hyper fork/join barrier does not behave like a pure 200 ms active spin. The wait loop may yield to the OS scheduler while still inside the blocktime window. On large teams this can produce a very large number of context switches, so the default behavior ends up much closer to passive waiting than to useful active waiting.

This is especially expensive for short workloads with several large parallel regions. Workers repeatedly leave and re-enter fork/join barriers, but many of them have already yielded or been descheduled, causing the next parallel region to pay a large scheduling cost.

This patch keeps the default barrier algorithm as `hyper,hyper`. It does not switch to the distributed barrier and does not change the documented defaults for `OMP_WAIT_POLICY`, `KMP_LIBRARY`, or `KMP_BLOCKTIME`.

Instead, for large fork/join hyper barriers, this patch uses a non-sleepable wait path for the relevant hyper barrier waits. This prevents those large-team fork/join waits from entering the finite-blocktime yield/suspend machinery that causes the context-switch storm, while preserving the existing behavior for smaller teams and non-fork/join barriers.

The threshold is currently 32 threads.

**Why This Fixes The Issue**

The original code used the ordinary sleepable `kmp_flag_64<>` wait path in the hyper fork/join barrier. Under finite blocktime, that path sets up blocktime accounting and may yield/suspend.

For large short-lived teams, that is counterproductive: the workers are expected to be needed again almost immediately by the next parallel region, but the default wait path can still involve OS scheduling activity.

The new code uses `kmp_flag_64<false, false>` for large fork/join hyper barrier waits. That keeps the wait active and avoids the finite-blocktime sleep/yield path for this specific large-team fork/join case. The generic wait helper was also adjusted so non-sleepable waits skip finite-blocktime sleep-deadline bookkeeping.

So the fix is narrow:

- default barrier remains `hyper,hyper`;
- default `KMP_BLOCKTIME` remains finite;
- small teams keep the existing behavior;
- large fork/join hyper waits avoid the problematic finite-blocktime yield/sleep machinery.

**Results**

Benchmark: `primecount 1e17`  
System: dual-socket AMD Zen2, 96 cores / 192 hardware threads

```text
old default:                        500.1 ms ± 52.4 ms
patched default:                    354.4 ms ±  5.1 ms
patched OMP_WAIT_POLICY=PASSIVE:    538.8 ms ± 23.9 ms
patched OMP_WAIT_POLICY=ACTIVE:     356.5 ms ±  5.1 ms
```

`perf stat -r 3` context switches:

```text
old default:              449,788 context-switches
patched default:              514 context-switches
```

Small-team sanity check, `primecount 1e16 --threads=8`:

```text
old default:              648.1 ms ± 27.3 ms
patched default:          645.7 ms ±  1.8 ms
```

This brings the default behavior for the large-team workload close to `OMP_WAIT_POLICY=ACTIVE`, without changing the global wait policy or changing the default fork/join barrier algorithm.

https://github.com/llvm/llvm-project/pull/195473