[llvm] [AArch64] Improve scalar and Neon popcount with SVE CNT. (PR #143870)

Ricardo Jesus via llvm-commits llvm-commits at lists.llvm.org
Thu Jun 26 08:36:08 PDT 2025


================
@@ -577,11 +670,25 @@ define <8 x i16> @popcount8x16(<8 x i16> %0) {
 ; CHECKO0-NEXT:    uaddlp v0.8h, v0.16b
 ; CHECKO0-NEXT:    ret
 ;
-; CHECK-LABEL: popcount8x16:
-; CHECK:       // %bb.0: // %Entry
-; CHECK-NEXT:    cnt v0.16b, v0.16b
-; CHECK-NEXT:    uaddlp v0.8h, v0.16b
-; CHECK-NEXT:    ret
+; NEON-LABEL: popcount8x16:
+; NEON:       // %bb.0: // %Entry
+; NEON-NEXT:    cnt v0.16b, v0.16b
+; NEON-NEXT:    uaddlp v0.8h, v0.16b
+; NEON-NEXT:    ret
+;
+; DOT-LABEL: popcount8x16:
+; DOT:       // %bb.0: // %Entry
+; DOT-NEXT:    cnt v0.16b, v0.16b
+; DOT-NEXT:    uaddlp v0.8h, v0.16b
+; DOT-NEXT:    ret
+;
+; SVE-LABEL: popcount8x16:
+; SVE:       // %bb.0: // %Entry
+; SVE-NEXT:    ptrue p0.h, vl8
+; SVE-NEXT:    // kill: def $q0 killed $q0 def $z0
+; SVE-NEXT:    cnt z0.h, p0/m, z0.h
+; SVE-NEXT:    // kill: def $q0 killed $q0 killed $z0
+; SVE-NEXT:    ret
----------------
rj-jesus wrote:

I believe in most real-world scenarios the PTRUE should be negligible, either because it's materialised well in advance or because it gets pipelined with other instructions along the critical path.

In somewhat unrealistic loops such as
```gas
neon:
  cnt    v0.16b, v0.16b
  uaddlp v0.8h, v0.16b
  subs   x0, x0, 1
  b.ne   neon
```
and
```gas
sve:
  ptrue p0.h, vl8
  cnt   z0.h, p0/m, z0.h
  subs  x0, x0, 1
  b.ne  sve
```
the SVE version is 2x faster than the Neon version (on Neoverse V2) due to the shorter critical path.

In loops such as
```cpp
  for (size_t i = 0; i < N; ++i)
    x[i] = __builtin_popcountg(x[i]);
```
I see no difference between the two versions since the popcount isn't on the critical path (but presumably the SVE version would be preferable in real-world cases due to using the V pipes fewer times and "shortening" the latency of the popcount).

Do you have a specific case in mind that you're worried about? For what it's worth, GCC have implemented similar lowering a few months ago ([Neon](https://github.com/gcc-mirror/gcc/commit/e4b8db26de35239bd621aad9c0361f25d957122b) and [scalar](https://github.com/gcc-mirror/gcc/commit/9ffcf1f193b477f417a4c1960cd32696a23b99b4)).

https://github.com/llvm/llvm-project/pull/143870


More information about the llvm-commits mailing list