[llvm] [AMDGPU] Vectorize i8 Shuffles (PR #105850)

Wed Oct 16 12:26:27 PDT 2024

================
@@ -363,11 +363,67 @@ bb:
   ret <4 x i16> %ins.3
 }
 
+define <4 x i8> @uadd_sat_v4i8(<4 x i8> %arg0, <4 x i8> %arg1, ptr addrspace(1) %dst) {
+; GCN-LABEL: @uadd_sat_v4i8(
+; GCN-NEXT:  bb:
+; GCN-NEXT:    [[TMP0:%.*]] = call <4 x i8> @llvm.uadd.sat.v4i8(<4 x i8> [[ARG0:%.*]], <4 x i8> [[ARG1:%.*]])
----------------
jrbyrnes wrote:

This is due to the calling convention, and is not regression of lowering @llvm.uadd.sat.v4i8 vs 4 x @llvm.uadd.sat.i8

The calling convention scalarizes i8 vectors similar to how we pass them across basic blocks. SLP cost model should account for the extracts needed for the vectorized version.

https://github.com/llvm/llvm-project/pull/105850