[clang] [Clang] Allow VDBPSADBW intrinsics in constexpr (PR #188887)

Sun Mar 29 14:22:06 PDT 2026

pierluigilenoci wrote:

I've pushed a complete rewrite of the VDBPSADBW constexpr algorithm. The previous implementation was fundamentally wrong — it only used 2 of the 4 imm8 bit fields and had an incorrect SAD computation structure.

**Root cause**: The old code extracted just `BlockOffsetA = (imm & 0x3) * 4` and `BlockOffsetB = ((imm >> 2) & 0x3) * 4`, then compared fixed blocks from src1 against sliding windows in src2. The real instruction does the opposite: it shuffles src2 using all four 2-bit fields, then computes sliding SADs between src1 and the shuffled result.

**Correct algorithm** (verified against GCC's reference in `gcc/testsuite/gcc.target/i386/avx512bw-vdbpsadbw-2.c`):

**Phase 1 — Shuffle src2**: Within each 128-bit lane, for each group j (0..3), the 2-bit field `(imm >> (2*j)) & 3` selects which 4-byte block of src2 to place at position `4*j` in the temporary array.

**Phase 2 — Sliding SAD**: For every group of 4 output u16 values at index i (stepping by 4):
```
dst[i]   = Σ|src1[2i+j]   - tmp[2i+j]  |  for j=0..3
dst[i+1] = Σ|src1[2i+j]   - tmp[2i+j+1]|  for j=0..3
dst[i+2] = Σ|src1[2i+j+4] - tmp[2i+j+2]|  for j=0..3
dst[i+3] = Σ|src1[2i+j+4] - tmp[2i+j+3]|  for j=0..3
```

**Verification**: `_mm_dbsad_epu8([0..15], [1..16], 4)` now produces `[4, 8, 4, 0, 28, 28, 44, 44]`, matching the hardware output from your earlier test.

Both `ExprConstant.cpp` and `InterpBuiltin.cpp` are rewritten with the same corrected algorithm, and all `TEST_CONSTEXPR` expected values have been recomputed.

https://github.com/llvm/llvm-project/pull/188887