[llvm] [VectorCombine] isExtractExtractCheap - specify the extract/insert shuffle mask to improve shuffle costs (PR #114780)

Mon Nov 11 03:59:14 PST 2024

================
@@ -688,9 +688,9 @@ define i32 @load_multiple_extracts_with_constant_idx(ptr %x) {
 define i32 @load_multiple_extracts_with_constant_idx_profitable(ptr %x) {
 ; CHECK-LABEL: @load_multiple_extracts_with_constant_idx_profitable(
 ; CHECK-NEXT:    [[LV:%.*]] = load <8 x i32>, ptr [[X:%.*]], align 16
-; CHECK-NEXT:    [[E_0:%.*]] = extractelement <8 x i32> [[LV]], i32 0
-; CHECK-NEXT:    [[E_1:%.*]] = extractelement <8 x i32> [[LV]], i32 6
-; CHECK-NEXT:    [[RES:%.*]] = add i32 [[E_0]], [[E_1]]
+; CHECK-NEXT:    [[SHIFT:%.*]] = shufflevector <8 x i32> [[LV]], <8 x i32> poison, <8 x i32> <i32 6, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
----------------
davemgreen wrote:

Hello. I think it should be "if the vector load after legalization has a single extract - then it will be scalarized". Multiple extracts still don't scalarize in general. I'm not sure if this is the best way forward or not though. There are limits to what the cost model can handle well, and this patch (minus the aarch64 cost model change) looks like a step in the right direction.

Am I correct that shrinking the type used to cost the vector binop, to express that the whole (possibly multi-)vector width is not needed only a single lane, does not work well in practice?

If so maybe we go with the original. I can try another test out too, to see if single element shuffles should be costed more like lane moves.

https://github.com/llvm/llvm-project/pull/114780