[llvm] [X86][Codegen] Shuffle certain shifts on i8 vectors to create opportunity for vectorized shift instructions (PR #117980)

Sun Dec 15 08:38:35 PST 2024

================
@@ -30044,6 +30150,136 @@ static SDValue LowerShift(SDValue Op, const X86Subtarget &Subtarget,
     }
   }
 
+  // SHL/SRL/SRA on vXi8 can be widened to vYi16 or vYi32 if the constant
+  // amounts can be shuffled such that every pair or quad of adjacent elements
+  // has the same value. This introduces an extra shuffle before and after the
+  // shift, and it is profitable if the operand is aready a shuffle so that both
+  // can be merged or the extra shuffle is fast.
+  // (shift (shuffle X P1) S1) ->
+  // (shuffle (shift (shuffle X (shuffle P2 P1)) S2) P2^-1) where S2 can be
+  // widened, and P2^-1 is the inverse shuffle of P2.
+  // This is not profitable on XOP or AVX512 becasue it has 8/16-bit vector
+  // variable shift instructions.
+  // Picking out GFNI because normally it implies AVX512, and there is no
+  // latency data for CPU with GFNI and SSE or AVX only, but there are tests for
+  // such combination anyways.
+  if (ConstantAmt &&
----------------
RKSimon wrote:

The lowering scheme immediately above this is very similar to what you're doing (and a lot easier to grok) - I'd recommend you look at extending that code instead of introducing this separate implementation.

https://github.com/llvm/llvm-project/pull/117980