[Openmp-commits] [openmp] [OpenMP] Use distributed fork/join barrier for large teams by default (PR #195473)

Sat May 2 11:08:56 PDT 2026

https://github.com/kimwalisch created https://github.com/llvm/llvm-project/pull/195473

Hi,

This is a fix for an LLVM OpenMP performance issue for short computations (≤ 1 second) on many-core systems with ≥ 32 CPU cores. I posted a detailed bug report with benchmark numbers against GCC's OpenMP library here: https://github.com/llvm/llvm-project/issues/195239.

With the help of an AI agent (Codex, GPT-5.5 at "Extra High") I found and fixed this performance issue in the LLVM OpenMP library code. Further down is a detailed description of this bug fix. I have personally tested the bugfix on my dual-socket AMD Zen2 machine with 96 cores (192 threads) and it indeed fixes the performance issue (without causing any other issues in my tests). I am not an LLVM OpenMP expert, so I suggest a knowledgeable human should review this pull request before merging it.

-----------------------------------------------------------------------------------------------------------------------

## Improve default libomp fork/join barrier for short large-team workloads

This patch changes libomp’s default fork/join barrier selection for large teams using finite blocktime. Today the default `hyper,hyper` fork/join barrier can perform poorly when many worker threads have yielded or gone to sleep: releasing the next parallel region can become a cascaded wake-up through the barrier tree.

For default teams of at least 32 threads, this patch uses the distributed fork/join barrier instead. The distributed barrier wakes sleeping workers more directly and avoids the large context-switch spike seen in short many-core workloads.

The change is intentionally conservative. It only applies when:

- the effective default team size is at least 32 threads,
- blocktime is finite,
- the user has not explicitly selected a barrier pattern,
- the user has not selected active/infinite waiting,
- the effective default team size has not been constrained below the threshold by settings such as `OMP_NUM_THREADS` or `OMP_THREAD_LIMIT`,
- the runtime is not using hardware-subset style narrowing such as `KMP_HW_SUBSET`.

Explicit user settings still win.

## Motivation

I observed a significant performance issue with LLVM OpenMP on a dual-socket AMD Zen2 machine with 96 cores / 192 hardware threads. The workload is `primecount 1e17`, which runs several short OpenMP parallel regions. With the old LLVM OpenMP default, the program behaves much closer to passive waiting than active waiting and produces a very large number of context switches.

Old LLVM OpenMP behavior from the original report:

| Command | Mean time |
|---|---:|
| `./primecount-clang 1e17` | `503.9 ms ± 53.3 ms` |
| `OMP_WAIT_POLICY=PASSIVE ./primecount-clang 1e17` | `483.8 ms ± 23.5 ms` |
| `OMP_WAIT_POLICY=ACTIVE ./primecount-clang 1e17` | `337.3 ms ± 4.1 ms` |

Linux `perf stat` for the old default showed:

| Metric | Old default |
|---|---:|
| context switches | `326,828` |
| elapsed time | `0.605154200 s` |

Using `OMP_WAIT_POLICY=ACTIVE` reduced context switches to `579`, suggesting that the default finite-blocktime wake-up path was the main issue.

## Results With This Patch

On the same workload, the new default is close to the active-waiting result while still using finite blocktime.

Local benchmark with patched libomp:

| Command | Mean time |
|---|---:|
| `./primecount-clang 1e17` | `344.8 ms ± 6.5 ms` |
| `OMP_WAIT_POLICY=ACTIVE ./primecount-clang 1e17` | `338.8 ms ± 2.7 ms` |
| `OMP_WAIT_POLICY=PASSIVE ./primecount-clang 1e17` | `340.8 ms ± 1.8 ms` |
| old fork/join barrier forced with `KMP_*_BARRIER_PATTERN=hyper,hyper` | `573.8 ms ± 43.8 ms` |

Linux `perf stat -r 3` comparison:

| Configuration | Context switches | CPU migrations | Elapsed time |
|---|---:|---:|---:|
| new default | `693` | `192` | `0.356731095 s` |
| old `hyper,hyper` barrier forced | `600,340` | `644` | `0.600528926 s` |

So for this workload, the new default reduces context switches by several hundred times and restores performance close to `OMP_WAIT_POLICY=ACTIVE`.

I ran different benchmarks using  `primecount 1e16 --threads=N` and found that `dist,dist` is roughly neutral for small teams and clearly faster from about 16 threads upward on this machine (dual-socket AMD Zen2 with 96 cores / 192 hardware threads, OS: Ubuntu 26.04). This patch uses 32 threads as a conservative cutoff.

## Why This Is Guarded

The distributed fork/join barrier is beneficial for large sleeping teams, but it can have higher overhead for small teams. Therefore this patch does not make an unconditional global default change. Small-team configurations such as `OMP_NUM_THREADS=8` or `OMP_THREAD_LIMIT=31` keep the previous `hyper,hyper` behavior.

## Tests

Added a regression test covering the default fork/join barrier selection policy:

- `OMP_NUM_THREADS=32` selects `dist,dist`,
- small `OMP_NUM_THREADS=8` keeps `hyper,hyper`,
- `OMP_WAIT_POLICY=ACTIVE` keeps `hyper,hyper`,
- `OMP_THREAD_LIMIT=31` keeps `hyper,hyper`,
- explicit `KMP_FORKJOIN_BARRIER_PATTERN=hyper,hyper` is respected.

Also ran:

- `llvm-lit openmp/runtime/test/barrier`
- `llvm-lit openmp/runtime/test/env/omp_wait_policy.c`


>From cdbb27234d6946243b7ef7c65c1c7ff23cc11379 Mon Sep 17 00:00:00 2001
From: kimwalisch <kim.walisch at gmail.com>
Date: Sat, 2 May 2026 18:57:52 +0200
Subject: [PATCH] [libomp] Use dist barrier for large teams

---
 openmp/runtime/src/kmp_settings.cpp           | 55 +++++++++++++++++++
 .../test/env/kmp_forkjoin_barrier_default.c   | 26 +++++++++
 2 files changed, 81 insertions(+)
 create mode 100644 openmp/runtime/test/env/kmp_forkjoin_barrier_default.c

diff --git a/openmp/runtime/src/kmp_settings.cpp b/openmp/runtime/src/kmp_settings.cpp
index 66ef6f8097dce..f823b92e45961 100644
--- a/openmp/runtime/src/kmp_settings.cpp
+++ b/openmp/runtime/src/kmp_settings.cpp
@@ -5787,6 +5787,58 @@ static inline kmp_setting_t *__kmp_stg_find(char const *name) {
 
 } // __kmp_stg_find
 
+static inline bool __kmp_stg_is_set(char const *name) {
+  kmp_setting_t *setting = __kmp_stg_find(name);
+  return setting && setting->set;
+}
+
+static inline bool __kmp_stg_restricts_default_team_size() {
+  return __kmp_stg_is_set("KMP_HW_SUBSET") ||
+         __kmp_stg_is_set("KMP_PLACE_THREADS") ||
+         __kmp_stg_is_set("GOMP_CPU_AFFINITY");
+}
+
+static void __kmp_stg_apply_default_barrier_patterns() {
+  // The distributed fork/join barrier wakes large sleeping teams without a
+  // tree-release cascade. Keep smaller teams on the lower-overhead hyper
+  // barrier.
+  constexpr int forkjoin_dist_barrier_min_threads = 32;
+
+  bool barrier_env =
+      __kmp_stg_is_set("KMP_PLAIN_BARRIER") ||
+      __kmp_stg_is_set("KMP_PLAIN_BARRIER_PATTERN") ||
+      __kmp_stg_is_set("KMP_FORKJOIN_BARRIER") ||
+      __kmp_stg_is_set("KMP_FORKJOIN_BARRIER_PATTERN");
+#if KMP_FAST_REDUCTION_BARRIER
+  barrier_env = barrier_env || __kmp_stg_is_set("KMP_REDUCTION_BARRIER") ||
+                __kmp_stg_is_set("KMP_REDUCTION_BARRIER_PATTERN");
+#endif
+  if (barrier_env || __kmp_dflt_blocktime == KMP_MAX_BLOCKTIME)
+    return;
+
+  // Leave non-default platform selections, such as the KNC hierarchical
+  // barrier, and explicit settings untouched.
+  if (__kmp_barrier_gather_pattern[bs_forkjoin_barrier] !=
+          __kmp_barrier_gather_pat_dflt ||
+      __kmp_barrier_release_pattern[bs_forkjoin_barrier] !=
+          __kmp_barrier_release_pat_dflt)
+    return;
+
+  int nth = __kmp_dflt_team_nth;
+  if (nth <= 0) {
+    if (__kmp_stg_restricts_default_team_size())
+      return;
+    nth = __kmp_dflt_team_nth_ub;
+  }
+  if (__kmp_cg_max_nth > 0)
+    nth = KMP_MIN(nth, __kmp_cg_max_nth);
+  if (nth < forkjoin_dist_barrier_min_threads)
+    return;
+
+  __kmp_barrier_gather_pattern[bs_forkjoin_barrier] = bp_dist_bar;
+  __kmp_barrier_release_pattern[bs_forkjoin_barrier] = bp_dist_bar;
+}
+
 static int __kmp_stg_cmp(void const *_a, void const *_b) {
   const kmp_setting_t *a = RCAST(const kmp_setting_t *, _a);
   const kmp_setting_t *b = RCAST(const kmp_setting_t *, _b);
@@ -6249,6 +6301,9 @@ void __kmp_env_initialize(char const *string) {
   for (i = 0; i < block.count; ++i) {
     __kmp_stg_parse(block.vars[i].name, block.vars[i].value);
   }
+  if (string == NULL) {
+    __kmp_stg_apply_default_barrier_patterns();
+  }
 
   // If user locks have been allocated yet, don't reset the lock vptr table.
   if (!__kmp_init_user_locks) {
diff --git a/openmp/runtime/test/env/kmp_forkjoin_barrier_default.c b/openmp/runtime/test/env/kmp_forkjoin_barrier_default.c
new file mode 100644
index 0000000000000..41492df1f6e03
--- /dev/null
+++ b/openmp/runtime/test/env/kmp_forkjoin_barrier_default.c
@@ -0,0 +1,26 @@
+// RUN: %libomp-compile
+// RUN: env KMP_SETTINGS=1 OMP_NUM_THREADS=32 %libomp-run 2>&1 \
+// RUN:   | FileCheck --check-prefix=LARGE %s
+// RUN: env KMP_SETTINGS=1 OMP_NUM_THREADS=8 %libomp-run 2>&1 \
+// RUN:   | FileCheck --check-prefix=SMALL %s
+// RUN: env KMP_SETTINGS=1 OMP_NUM_THREADS=32 OMP_WAIT_POLICY=ACTIVE \
+// RUN:   %libomp-run 2>&1 | FileCheck --check-prefix=ACTIVE %s
+// RUN: env KMP_SETTINGS=1 OMP_THREAD_LIMIT=31 %libomp-run 2>&1 \
+// RUN:   | FileCheck --check-prefix=LIMIT %s
+// RUN: env KMP_SETTINGS=1 OMP_NUM_THREADS=32 \
+// RUN:   KMP_FORKJOIN_BARRIER_PATTERN=hyper,hyper %libomp-run 2>&1 \
+// RUN:   | FileCheck --check-prefix=EXPLICIT %s
+
+#include <omp.h>
+#include <stdio.h>
+
+int main() {
+  printf("%d\n", omp_get_max_threads());
+  return 0;
+}
+
+// LARGE: KMP_FORKJOIN_BARRIER_PATTERN='dist,dist'
+// SMALL: KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
+// ACTIVE: KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
+// LIMIT: KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
+// EXPLICIT: KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'