[llvm] [AMDGPU] Vectorize more 16 bit shuffles (PR #90648)

Tue May 7 12:39:37 PDT 2024

================
@@ -211,6 +267,18 @@ define <2 x i32> @ssub_sat_v2i32(<2 x i32> %arg0, <2 x i32> %arg1) {
 ; GCN-NEXT:    [[INS_1:%.*]] = insertelement <2 x i32> [[INS_0]], i32 [[ADD_1]], i64 1
 ; GCN-NEXT:    ret <2 x i32> [[INS_1]]
 ;
+; GFX9-LABEL: @ssub_sat_v2i32(
+; GFX9-NEXT:  bb:
+; GFX9-NEXT:    [[ARG0_0:%.*]] = extractelement <2 x i32> [[ARG0:%.*]], i64 0
+; GFX9-NEXT:    [[ARG0_1:%.*]] = extractelement <2 x i32> [[ARG0]], i64 1
+; GFX9-NEXT:    [[ARG1_0:%.*]] = extractelement <2 x i32> [[ARG1:%.*]], i64 0
+; GFX9-NEXT:    [[ARG1_1:%.*]] = extractelement <2 x i32> [[ARG1]], i64 1
+; GFX9-NEXT:    [[ADD_0:%.*]] = call i32 @llvm.ssub.sat.i32(i32 [[ARG0_0]], i32 [[ARG1_0]])
----------------
jrbyrnes wrote:

> Why did i32 cases change?

It was because I changed the way checks were generated. Fixed.

>  Gfx940 has some packed 32-bit ops but I'm not sure this cost model was ever updated to account for that

Looks like the cost model has accurate widths and cost for PackedFP32 https://github.com/llvm/llvm-project/blob/7115ed0fff027b65fa76fdfae215ed1382ed1473/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp#L609

https://github.com/llvm/llvm-project/pull/90648