[libc-commits] [clang] [compiler-rt] [flang] [libc] [libcxx] [lldb] [llvm] [RFC][Clang] Add int256/uint256 builtin types (PR #182733)

Thu Feb 26 11:49:41 PST 2026

xroche wrote:

You're right -- I verified this and the vectorizer handles the loop case well.

With `-O2 -march=native` (GFNI + AVX2), the `uint64_t[4]` loop and the manually unrolled version both produce identical vectorized code: `vpxor ymm` + `vgf2p8affineqb` + `vpshufb` + `vpsadbw` + horizontal reduction. The `__uint256_t` version actually produces *worse* code: 4x scalar `popcntq` + `addl`, because the value lives in GPRs, not vector registers.

With AVX-512 VPOPCNTDQ, same story: the loop gets `vpxor ymm` + `vpopcntq ymm` + `vpmovqb` + `vpsadbw` (8 instructions), while `__uint256_t` stays scalar (11 instructions).

The 18% speedup I measured was a red herring -- scalar `popcntq` happened to be faster than the GFNI-based vector popcount path on the specific test CPU, not a real advantage of the type.

I'll remove the Hamming distance claim from the PR description. The stronger motivation for `__int256` is arithmetic ergonomics and performance vs `_BitInt(256)` (3x for add/sub/bitwise, 1.5x for division), not SIMD popcount.

https://github.com/llvm/llvm-project/pull/182733

[libc-commits] [clang] [compiler-rt] [flang] [libc] [libcxx] [lldb] [llvm] [RFC][Clang] Add __int256/__uint256 builtin types (PR #182733)

[libc-commits] [clang] [compiler-rt] [flang] [libc] [libcxx] [lldb] [llvm] [RFC][Clang] Add int256/uint256 builtin types (PR #182733)