[libcxx-commits] [libcxx] Speed up compilation of common uses of std::visit() (PR #164196)

Fri Nov 28 08:32:13 PST 2025

higher-performance wrote:

Done.

First, re: the runtime benchmarks, I had to run them a bit ad-hoc via googlebenchmark since I don't have the official setup handy, but regardless -- they actually indicate a speedup for < 8 elements:

Before:
```
Benchmark                       Time(ns)      CPU(ns)  Iterations
BM_Visit<1, 1>_mean              2.13           2.13    25000000  
BM_Visit<1, 2>_mean              3.22           3.22    25000000  
BM_Visit<1, 3>_mean              3.20           3.20    25000000  
BM_Visit<1, 4>_mean              3.21           3.21    25000000  
BM_Visit<1, 5>_mean              3.21           3.20    25000000  
BM_Visit<1, 6>_mean              3.22           3.22    25000000  
BM_Visit<1, 7>_mean              3.20           3.20    25000000  
BM_Visit<1, 8>_mean              3.21           3.21    25000000
```

After:
```
Benchmark                       Time(ns)      CPU(ns)  Iterations
BM_Visit<1, 1>_mean              2.19           2.19    25000000  
BM_Visit<1, 2>_mean              2.20           2.20    25000000  
BM_Visit<1, 3>_mean              2.18           2.18    25000000  
BM_Visit<1, 4>_mean              2.18           2.18    25000000  
BM_Visit<1, 5>_mean              2.22           2.22    25000000  
BM_Visit<1, 6>_mean              2.19           2.19    25000000  
BM_Visit<1, 7>_mean              2.19           2.19    25000000  
BM_Visit<1, 8>_mean              3.27           3.27    25000000  
```

As for compile-time benchmarking, I also tested it like this:

```
#include <variant>

int main(int argc, char* argv[]) {
  std::variant<char, unsigned char, int> v;
  v.emplace<0>(3);
  int n = 0;
  unsigned int r = 1;
#define X(V) \
  ++n;       \
  std::visit([&](int x) { r *= x; }, V)
  (void)--n, X(v);
#ifdef NEW_VERSION
  // clang-format off
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
  X(v); X(v); X(v); X(v); X(v); X(v); X(v); X(v);
// clang-format on
#else
  (void)v;
#endif
#undef X

  return r % 1000 == 1 ? -1 : n;
}
```

Under `-O3` I got:

- Baseline: only 1 variant call: 5216 bytes  
- 64 extra calls (new implementation): 5216 bytes, +0.1 ms  
- 64 extra calls (old implementation): 54104 bytes, +0.43 ms

My setup/system is a bit different from last time, so it's not quite 8x here, but still, it's a huge win.

**tl;dr: it's a strict win on every axis I measure.** @philnik777 

https://github.com/llvm/llvm-project/pull/164196