[libc-commits] [clang] [compiler-rt] [flang] [libc] [libcxx] [lldb] [llvm] [RFC][Clang] Add __int256/__uint256 builtin types (PR #182733)
Xavier Roche via libc-commits
libc-commits at lists.llvm.org
Thu Feb 26 11:49:41 PST 2026
xroche wrote:
You're right -- I verified this and the vectorizer handles the loop case well.
With `-O2 -march=native` (GFNI + AVX2), the `uint64_t[4]` loop and the manually unrolled version both produce identical vectorized code: `vpxor ymm` + `vgf2p8affineqb` + `vpshufb` + `vpsadbw` + horizontal reduction. The `__uint256_t` version actually produces *worse* code: 4x scalar `popcntq` + `addl`, because the value lives in GPRs, not vector registers.
With AVX-512 VPOPCNTDQ, same story: the loop gets `vpxor ymm` + `vpopcntq ymm` + `vpmovqb` + `vpsadbw` (8 instructions), while `__uint256_t` stays scalar (11 instructions).
The 18% speedup I measured was a red herring -- scalar `popcntq` happened to be faster than the GFNI-based vector popcount path on the specific test CPU, not a real advantage of the type.
I'll remove the Hamming distance claim from the PR description. The stronger motivation for `__int256` is arithmetic ergonomics and performance vs `_BitInt(256)` (3x for add/sub/bitwise, 1.5x for division), not SIMD popcount.
https://github.com/llvm/llvm-project/pull/182733
More information about the libc-commits
mailing list