[llvm] [LegalizeTypes] Use vector CTPOP for wide integer popcount on VPOPCNTDQ targets (PR #182675)
via llvm-commits
llvm-commits at lists.llvm.org
Sat Feb 21 06:06:12 PST 2026
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-x86
Author: Xavier Roche (xroche)
<details>
<summary>Changes</summary>
## Summary
When expanding `CTPOP` for wide integer types (i256, i512, etc.) during type legalization, the current code recursively splits into halves and computes scalar `popcntq` for each 64-bit word. On targets with AVX512-VPOPCNTDQ (Ice Lake+), this misses the opportunity to use the efficient native vector popcount instruction.
This patch adds an optimization path in `ExpandIntRes_CTPOP` that:
1. Bitcasts the wide integer to a legal vector type (e.g., `i256` → `v4i64`)
2. Uses vector `CTPOP` when the target reports it as **Legal**
3. Reduces via a **shuffle+add pyramid** that `matchBinOpReduction` recognizes, enabling the X86 `combineArithReduction` to fold the reduction into PSADBW
### Supersedes #<!-- -->182547
The previous PR had two problems:
- **i128 regression** (caught by @<!-- -->anematode): the vector path was taken for i128, where the scalar expansion (2x `popcntq` + `addl`) is already optimal. **Fixed**: added `BitWidth >= 256` guard.
- **Suboptimal reduction**: used `EXTRACT_SUBVECTOR` + `ADD`, which `matchBinOpReduction` cannot recognize. **Fixed**: switched to `VECTOR_SHUFFLE` + `ADD` pyramid, enabling the existing PSADBW combine (@<!-- -->RKSimon's suggestion).
### Codegen: `ctpop(i256)` on AVX512VL+VPOPCNTDQ (Ice Lake+)
**Before** (4x scalar popcnt):
```asm
popcntq %rcx, %rax
popcntq %rdx, %rcx
addl %eax, %ecx
popcntq %rsi, %rdx
popcntq %rdi, %rax
addl %edx, %eax
addl %ecx, %eax ; 7 instructions
```
**After** — when result is extracted as `i64` (PSADBW combine fires):
```asm
vpopcntq %ymm0, %ymm0 ; vectorized popcount
vpmovqb %ymm0, %xmm0 ; truncate v4i64→v4i8 (max 64 fits in byte)
vpsadbw %xmm1, %xmm0, %xmm0 ; horizontal byte sum
vmovq %xmm0, %rax ; 4 instructions
```
**After** — when result is truncated to `i32` (shuffle+add reduction):
```asm
vpopcntq %ymm0, %ymm0
vextracti128 $1, %ymm0, %xmm1
vpaddq %xmm1, %xmm0, %xmm0
vpshufd $238, %xmm0, %xmm1
vpaddq %xmm1, %xmm0, %xmm0
vmovd %xmm0, %eax ; 6 instructions (vs 7 before)
```
The PSADBW path fires when `combineArithReduction` sees an `EXTRACT_VECTOR_ELT(i64, ...)` feeding the reduction. When the result is truncated to `i32`, the `EXTRACT_VECTOR_ELT` type changes (to `v8i32` via bitcast), preventing `matchBinOpReduction` from matching — but the shuffle+add reduction is still an improvement over the scalar path.
### Scope & safety
- **i128 is unaffected**: `BitWidth >= 256` guard ensures scalar path (`2x popcntq + addl`)
- **Non-VPOPCNTDQ targets unaffected**: `isOperationLegal(CTPOP, v4i64)` is false → scalar path
- **AArch64 / other targets unaffected**: same → scalar path
- **Generalizes** to any width: i256 with v4i64, i512 with v8i64, etc.
### Why shuffle+add instead of extract_subvector+add?
`matchBinOpReduction` (SelectionDAG.cpp) only matches `ISD::VECTOR_SHUFFLE` nodes, not `ISD::EXTRACT_SUBVECTOR`. By using a shuffle+add pyramid at the full vector width, the existing X86 `combineArithReduction` can recognize the pattern and emit the optimal `vmovd(vpsadbw(vpmovqb(vpopcntq())))` sequence.
## Test plan
- [x] Updated `llvm/test/CodeGen/X86/bitcnt-big-integer.ll` with regenerated CHECK lines for all feature levels
- [x] `ninja check-llvm-codegen-x86` — `bitcnt-big-integer.ll` passes, no new failures
- [x] Verified i128 codegen is unchanged (no regression)
- [x] Verified codegen manually with `llc` for VPOPCNTDQ, AVX2+POPCNT, and POPCNT-only
- [x] Verified PSADBW fires for i64 return path on `-mcpu=x86-64-v4 -mattr=+avx512vpopcntdq`
**Disclaimer:** This contribution is LLM-assisted (Claude). I have reviewed all generated code, verified the approach against the X86 backend sources (`combineArithReduction`, `matchBinOpReduction`), and tested manually. That said, this may not fully meet the level of contribution expected for this project — I'm very open to feedback and happy to iterate!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Assisted-by: Claude (Anthropic)
---
Patch is 67.67 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/182675.diff
2 Files Affected:
- (modified) llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp (+44)
- (modified) llvm/test/CodeGen/X86/bitcnt-big-integer.ll (+904-191)
``````````diff
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 0d5cba405d6e3..2210b3aa6a64f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -4205,6 +4205,50 @@ void DAGTypeLegalizer::ExpandIntRes_CTPOP(SDNode *N, SDValue &Lo, SDValue &Hi) {
// If the function is not available, fall back on the expansion.
}
+ // Optimization: if the integer fits in a legal vector type and the target
+ // has efficient vector CTPOP, use bitcast -> vector ctpop -> horizontal sum.
+ // This avoids extracting to scalar for each word (e.g. on x86, this enables
+ // VPOPCNTDQ instead of 4x scalar popcntq).
+ //
+ // We require >= 256 bits because for 128-bit integers the scalar expansion
+ // (2x popcntq + add) is already efficient, while the vector path introduces
+ // costly GPR-to-XMM domain crossings when the value is in registers.
+ unsigned BitWidth = VT.getSizeInBits();
+ if (BitWidth >= 256 && isPowerOf2_32(BitWidth)) {
+ MVT EltVT = MVT::i64;
+ unsigned NumElts = BitWidth / 64;
+ MVT VecVT = MVT::getVectorVT(EltVT, NumElts);
+ if (VecVT != MVT::INVALID_SIMPLE_VALUE_TYPE && TLI.isTypeLegal(VecVT) &&
+ TLI.isOperationLegal(ISD::CTPOP, VecVT)) {
+ // Bitcast integer to vector (free at register level).
+ SDValue Vec = DAG.getBitcast(VecVT, Op);
+ // Per-element popcount (target lowers to PSHUFB+PSADBW or VPOPCNTDQ).
+ SDValue PopVec = DAG.getNode(ISD::CTPOP, DL, VecVT, Vec);
+ // Sum all elements via shuffle+add pyramid reduction. Using
+ // VECTOR_SHUFFLE (rather than EXTRACT_SUBVECTOR) enables
+ // matchBinOpReduction to recognize the pattern and fold to PSADBW.
+ unsigned ReduxWidth = NumElts;
+ while (ReduxWidth > 1) {
+ unsigned HalfWidth = ReduxWidth / 2;
+ SmallVector<int, 16> ShufMask(NumElts, -1);
+ for (unsigned i = 0; i < HalfWidth; ++i)
+ ShufMask[i] = i + HalfWidth;
+ SDValue Shuf = DAG.getVectorShuffle(VecVT, DL, PopVec,
+ DAG.getUNDEF(VecVT), ShufMask);
+ PopVec = DAG.getNode(ISD::ADD, DL, VecVT, PopVec, Shuf);
+ ReduxWidth = HalfWidth;
+ }
+ // Extract scalar i64 result.
+ SDValue Result = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::i64,
+ PopVec, DAG.getVectorIdxConstant(0, DL));
+ // Split into Lo/Hi for type legalization.
+ EVT NVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
+ Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, NVT, Result);
+ Hi = DAG.getConstant(0, DL, NVT);
+ return;
+ }
+ }
+
// ctpop(HiLo) -> ctpop(Hi)+ctpop(Lo)
GetExpandedInteger(Op, Lo, Hi);
EVT NVT = Lo.getValueType();
diff --git a/llvm/test/CodeGen/X86/bitcnt-big-integer.ll b/llvm/test/CodeGen/X86/bitcnt-big-integer.ll
index 06ccbf4daa1e8..d74e1a880fa47 100644
--- a/llvm/test/CodeGen/X86/bitcnt-big-integer.ll
+++ b/llvm/test/CodeGen/X86/bitcnt-big-integer.ll
@@ -3,6 +3,7 @@
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v3 | FileCheck %s --check-prefixes=CHECK,AVX2
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl | FileCheck %s --check-prefixes=AVX512,AVX512F
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 | FileCheck %s --check-prefixes=AVX512,AVX512VL
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knm | FileCheck %s --check-prefixes=AVX512,AVX512VPOPCNTDQ
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64-v4 -mattr=+avx512vpopcntdq | FileCheck %s --check-prefixes=AVX512,AVX512POPCNT
;
@@ -94,6 +95,16 @@ define i32 @vector_ctpop_i128(<4 x i32> %v0) nounwind {
; AVX512VL-NEXT: # kill: def $eax killed $eax killed $rax
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: vector_ctpop_i128:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vpextrq $1, %xmm0, %rax
+; AVX512VPOPCNTDQ-NEXT: vmovq %xmm0, %rcx
+; AVX512VPOPCNTDQ-NEXT: popcntq %rax, %rdx
+; AVX512VPOPCNTDQ-NEXT: popcntq %rcx, %rax
+; AVX512VPOPCNTDQ-NEXT: addl %edx, %eax
+; AVX512VPOPCNTDQ-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: vector_ctpop_i128:
; AVX512POPCNT: # %bb.0:
; AVX512POPCNT-NEXT: vmovq %xmm0, %rax
@@ -152,19 +163,39 @@ define i32 @test_ctpop_i256(i256 %a0) nounwind {
; AVX512VL-NEXT: # kill: def $eax killed $eax killed $rax
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: test_ctpop_i256:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vmovq %rcx, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdx, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; AVX512VPOPCNTDQ-NEXT: vmovq %rsi, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdi, %xmm2
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512VPOPCNTDQ-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: test_ctpop_i256:
; AVX512POPCNT: # %bb.0:
-; AVX512POPCNT-NEXT: popcntq %rcx, %rax
-; AVX512POPCNT-NEXT: xorl %ecx, %ecx
-; AVX512POPCNT-NEXT: popcntq %rdx, %rcx
-; AVX512POPCNT-NEXT: addl %eax, %ecx
-; AVX512POPCNT-NEXT: xorl %edx, %edx
-; AVX512POPCNT-NEXT: popcntq %rsi, %rdx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq %rdi, %rax
-; AVX512POPCNT-NEXT: addl %edx, %eax
-; AVX512POPCNT-NEXT: addl %ecx, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vmovq %rcx, %xmm0
+; AVX512POPCNT-NEXT: vmovq %rdx, %xmm1
+; AVX512POPCNT-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; AVX512POPCNT-NEXT: vmovq %rsi, %xmm1
+; AVX512POPCNT-NEXT: vmovq %rdi, %xmm2
+; AVX512POPCNT-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512POPCNT-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
+; AVX512POPCNT-NEXT: vpopcntq %ymm0, %ymm0
+; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
+; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%cnt = call i256 @llvm.ctpop.i256(i256 %a0)
%res = trunc i256 %cnt to i32
@@ -222,17 +253,26 @@ define i32 @load_ctpop_i256(ptr %p0) nounwind {
; AVX512VL-NEXT: # kill: def $eax killed $eax killed $rax
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: load_ctpop_i256:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vmovdqu (%rdi), %ymm0
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: load_ctpop_i256:
; AVX512POPCNT: # %bb.0:
-; AVX512POPCNT-NEXT: popcntq 24(%rdi), %rax
-; AVX512POPCNT-NEXT: popcntq 16(%rdi), %rcx
-; AVX512POPCNT-NEXT: addl %eax, %ecx
-; AVX512POPCNT-NEXT: popcntq 8(%rdi), %rdx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq (%rdi), %rax
-; AVX512POPCNT-NEXT: addl %edx, %eax
-; AVX512POPCNT-NEXT: addl %ecx, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vpopcntq (%rdi), %ymm0
+; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
+; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%a0 = load i256, ptr %p0
%cnt = call i256 @llvm.ctpop.i256(i256 %a0)
@@ -316,23 +356,25 @@ define i32 @vector_ctpop_i256(<8 x i32> %v0) nounwind {
; AVX512VL-NEXT: vzeroupper
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: vector_ctpop_i256:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: vector_ctpop_i256:
; AVX512POPCNT: # %bb.0:
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm0, %rax
-; AVX512POPCNT-NEXT: vmovq %xmm0, %rcx
-; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm0
-; AVX512POPCNT-NEXT: vmovq %xmm0, %rdx
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm0, %rsi
-; AVX512POPCNT-NEXT: popcntq %rsi, %rsi
-; AVX512POPCNT-NEXT: popcntq %rdx, %rdx
-; AVX512POPCNT-NEXT: addl %esi, %edx
-; AVX512POPCNT-NEXT: xorl %esi, %esi
-; AVX512POPCNT-NEXT: popcntq %rax, %rsi
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq %rcx, %rax
-; AVX512POPCNT-NEXT: addl %esi, %eax
-; AVX512POPCNT-NEXT: addl %edx, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vpopcntq %ymm0, %ymm0
+; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%a0 = bitcast <8 x i32> %v0 to i256
@@ -412,29 +454,53 @@ define i32 @test_ctpop_i512(i512 %a0) nounwind {
; AVX512VL-NEXT: # kill: def $eax killed $eax killed $rax
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: test_ctpop_i512:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vmovq %rcx, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdx, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; AVX512VPOPCNTDQ-NEXT: vmovq %rsi, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdi, %xmm2
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512VPOPCNTDQ-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
+; AVX512VPOPCNTDQ-NEXT: vmovq %r9, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vmovq %r8, %xmm2
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512VPOPCNTDQ-NEXT: vinserti128 $1, {{[0-9]+}}(%rsp), %ymm1, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: test_ctpop_i512:
; AVX512POPCNT: # %bb.0:
-; AVX512POPCNT-NEXT: popcntq {{[0-9]+}}(%rsp), %rax
-; AVX512POPCNT-NEXT: popcntq {{[0-9]+}}(%rsp), %r10
-; AVX512POPCNT-NEXT: addl %eax, %r10d
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq %r9, %rax
-; AVX512POPCNT-NEXT: popcntq %r8, %r8
-; AVX512POPCNT-NEXT: addl %eax, %r8d
-; AVX512POPCNT-NEXT: addl %r10d, %r8d
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq %rcx, %rax
-; AVX512POPCNT-NEXT: xorl %ecx, %ecx
-; AVX512POPCNT-NEXT: popcntq %rdx, %rcx
-; AVX512POPCNT-NEXT: addl %eax, %ecx
-; AVX512POPCNT-NEXT: xorl %edx, %edx
-; AVX512POPCNT-NEXT: popcntq %rsi, %rdx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq %rdi, %rax
-; AVX512POPCNT-NEXT: addl %edx, %eax
-; AVX512POPCNT-NEXT: addl %ecx, %eax
-; AVX512POPCNT-NEXT: addl %r8d, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vmovq %rcx, %xmm0
+; AVX512POPCNT-NEXT: vmovq %rdx, %xmm1
+; AVX512POPCNT-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; AVX512POPCNT-NEXT: vmovq %rsi, %xmm1
+; AVX512POPCNT-NEXT: vmovq %rdi, %xmm2
+; AVX512POPCNT-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512POPCNT-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
+; AVX512POPCNT-NEXT: vmovq %r9, %xmm1
+; AVX512POPCNT-NEXT: vmovq %r8, %xmm2
+; AVX512POPCNT-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512POPCNT-NEXT: vinserti128 $1, {{[0-9]+}}(%rsp), %ymm1, %ymm1
+; AVX512POPCNT-NEXT: vinserti64x4 $1, %ymm1, %zmm0, %zmm0
+; AVX512POPCNT-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512POPCNT-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512POPCNT-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
+; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%cnt = call i512 @llvm.ctpop.i512(i512 %a0)
%res = trunc i512 %cnt to i32
@@ -533,28 +599,29 @@ define i32 @load_ctpop_i512(ptr %p0) nounwind {
; AVX512VL-NEXT: # kill: def $eax killed $eax killed $rax
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: load_ctpop_i512:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vpopcntq (%rdi), %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: load_ctpop_i512:
; AVX512POPCNT: # %bb.0:
-; AVX512POPCNT-NEXT: popcntq 56(%rdi), %rax
-; AVX512POPCNT-NEXT: popcntq 48(%rdi), %rcx
-; AVX512POPCNT-NEXT: addl %eax, %ecx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq 40(%rdi), %rax
-; AVX512POPCNT-NEXT: popcntq 32(%rdi), %rdx
-; AVX512POPCNT-NEXT: addl %eax, %edx
-; AVX512POPCNT-NEXT: addl %ecx, %edx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq 24(%rdi), %rax
-; AVX512POPCNT-NEXT: xorl %ecx, %ecx
-; AVX512POPCNT-NEXT: popcntq 16(%rdi), %rcx
-; AVX512POPCNT-NEXT: popcntq 8(%rdi), %rsi
-; AVX512POPCNT-NEXT: addl %eax, %ecx
-; AVX512POPCNT-NEXT: xorl %eax, %eax
-; AVX512POPCNT-NEXT: popcntq (%rdi), %rax
-; AVX512POPCNT-NEXT: addl %esi, %eax
-; AVX512POPCNT-NEXT: addl %ecx, %eax
-; AVX512POPCNT-NEXT: addl %edx, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vpopcntq (%rdi), %zmm0
+; AVX512POPCNT-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512POPCNT-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
+; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%a0 = load i512, ptr %p0
%cnt = call i512 @llvm.ctpop.i512(i512 %a0)
@@ -685,35 +752,28 @@ define i32 @vector_ctpop_i512(<16 x i32> %v0) nounwind {
; AVX512VL-NEXT: vzeroupper
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: vector_ctpop_i512:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
; AVX512POPCNT-LABEL: vector_ctpop_i512:
; AVX512POPCNT: # %bb.0:
+; AVX512POPCNT-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512POPCNT-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512POPCNT-NEXT: vpaddq %zmm1, %zmm0, %zmm0
; AVX512POPCNT-NEXT: vextracti128 $1, %ymm0, %xmm1
-; AVX512POPCNT-NEXT: vmovq %xmm1, %rax
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm1, %rcx
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm0, %rdx
-; AVX512POPCNT-NEXT: vmovq %xmm0, %rsi
-; AVX512POPCNT-NEXT: vextracti32x4 $2, %zmm0, %xmm1
-; AVX512POPCNT-NEXT: vmovq %xmm1, %rdi
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm1, %r8
-; AVX512POPCNT-NEXT: vextracti32x4 $3, %zmm0, %xmm0
-; AVX512POPCNT-NEXT: vmovq %xmm0, %r9
-; AVX512POPCNT-NEXT: vpextrq $1, %xmm0, %r10
-; AVX512POPCNT-NEXT: popcntq %r10, %r10
-; AVX512POPCNT-NEXT: popcntq %r9, %r9
-; AVX512POPCNT-NEXT: addl %r10d, %r9d
-; AVX512POPCNT-NEXT: popcntq %r8, %r8
-; AVX512POPCNT-NEXT: popcntq %rdi, %rdi
-; AVX512POPCNT-NEXT: addl %r8d, %edi
-; AVX512POPCNT-NEXT: addl %r9d, %edi
-; AVX512POPCNT-NEXT: popcntq %rdx, %rdx
-; AVX512POPCNT-NEXT: popcntq %rsi, %rsi
-; AVX512POPCNT-NEXT: addl %edx, %esi
-; AVX512POPCNT-NEXT: popcntq %rcx, %rcx
-; AVX512POPCNT-NEXT: popcntq %rax, %rax
-; AVX512POPCNT-NEXT: addl %ecx, %eax
-; AVX512POPCNT-NEXT: addl %esi, %eax
-; AVX512POPCNT-NEXT: addl %edi, %eax
-; AVX512POPCNT-NEXT: # kill: def $eax killed $eax killed $rax
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512POPCNT-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512POPCNT-NEXT: vmovd %xmm0, %eax
; AVX512POPCNT-NEXT: vzeroupper
; AVX512POPCNT-NEXT: retq
%a0 = bitcast <16 x i32> %v0 to i512
@@ -917,56 +977,71 @@ define i32 @test_ctpop_i1024(i1024 %a0) nounwind {
; AVX512VL-NEXT: popq %r14
; AVX512VL-NEXT: retq
;
+; AVX512VPOPCNTDQ-LABEL: test_ctpop_i1024:
+; AVX512VPOPCNTDQ: # %bb.0:
+; AVX512VPOPCNTDQ-NEXT: vmovq %rcx, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdx, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
+; AVX512VPOPCNTDQ-NEXT: vmovq %rsi, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vmovq %rdi, %xmm2
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm2[0],xmm1[0]
+; AVX512VPOPCNTDQ-NEXT: vmovq %r9, %xmm2
+; AVX512VPOPCNTDQ-NEXT: vmovq %r8, %xmm3
+; AVX512VPOPCNTDQ-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm3[0],xmm2[0]
+; AVX512VPOPCNTDQ-NEXT: vinserti128 $1, {{[0-9]+}}(%rsp), %ymm2, %ymm2
+; AVX512VPOPCNTDQ-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
+; AVX512VPOPCNTDQ-NEXT: vinserti64x4 $1, %ymm2, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vpopcntq %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %ecx
+; AVX512VPOPCNTDQ-NEXT: vpopcntq {{[0-9]+}}(%rsp), %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti64x4 $1, %zmm0, %ymm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %zmm1, %zmm0, %zmm0
+; AVX512VPOPCNTDQ-NEXT: vextracti128 $1, %ymm0, %xmm1
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
+; AVX512VPOPCNTDQ-NEXT: vpaddq %xmm1, %xmm0, %xmm0
+; AVX512VPOPCNTDQ-NEXT: vmovd %xmm0, %eax
+; AVX512VPOPCNTDQ-NEXT: addl %ecx, %eax
+; AVX512VPOPCNTDQ-NEXT: retq
+;
...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/182675
More information about the llvm-commits
mailing list