[PATCH] [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

Bruno Cardoso Lopes bruno.cardoso at gmail.com
Thu May 28 06:00:45 PDT 2015

This is awesome Chandler, thank you! Thanks Fiora! :D

I agree that keeping it in the vector unit is likely better when we already have vector ops around. We should do that!
FTR, some new haswell measurements from your patch for the cases where the implementation changed:

v8i32-avx2 -> sselookup now beats scalar ctpop \o/

scalar ctpop (v8i32): 3.93436
sselookup (v8i32): 3.36358

v4i32-avx -> yay, scalar is only slightly better over runs but improved significantly from my previous patch!

scalar ctpop (v4i32): 0.916582
sselookup (v4i32): from ~1.10 to 0.963113

That said, LGTM. Some minor comments in the patch below.

Comment at: lib/Target/X86/X86ISelLowering.cpp:850
@@ +849,3 @@
+      // instructions on v4i32/v4i64 elements than to custom lower ctpop.
+      if (!Subtarget->hasPOPCNT()) {
+        setOperationAction(ISD::CTPOP,            MVT::v4i32, Custom);
Need to remove this check since we're not going to fallback anymore for >= SSSE3

Comment at: lib/Target/X86/X86ISelLowering.cpp:1125
@@ +1124,3 @@
+    // always profitable if scalar popcnt is not available.
+    if (!Subtarget->hasPOPCNT())
+      setOperationAction(ISD::CTPOP,           MVT::v4i64, Custom);
Same here

Comment at: lib/Target/X86/X86ISelLowering.cpp:1160
@@ -1146,1 +1159,3 @@
+      setOperationAction(ISD::CTPOP,           MVT::v4i64, Custom);
       // The custom lowering for UINT_TO_FP for v8i32 becomes interesting
With the change in the last comment above this line can go away

Comment at: lib/Target/X86/X86ISelLowering.cpp:17392
@@ +17391,3 @@
+  // Mask and shift to extract 32-bit components, use two PSADBW to pop count
+  // each one and OR the result.
+  if (EltVT == MVT::i32) {
This comment can be removed



More information about the llvm-commits mailing list