[llvm] [VectorCombine] isExtractExtractCheap - specify the extract/insert shuffle mask to improve shuffle costs (PR #114780)

Mon Nov 4 09:39:19 PST 2024

================
@@ -688,9 +688,9 @@ define i32 @load_multiple_extracts_with_constant_idx(ptr %x) {
 define i32 @load_multiple_extracts_with_constant_idx_profitable(ptr %x) {
 ; CHECK-LABEL: @load_multiple_extracts_with_constant_idx_profitable(
 ; CHECK-NEXT:    [[LV:%.*]] = load <8 x i32>, ptr [[X:%.*]], align 16
-; CHECK-NEXT:    [[E_0:%.*]] = extractelement <8 x i32> [[LV]], i32 0
-; CHECK-NEXT:    [[E_1:%.*]] = extractelement <8 x i32> [[LV]], i32 6
-; CHECK-NEXT:    [[RES:%.*]] = add i32 [[E_0]], [[E_1]]
+; CHECK-NEXT:    [[SHIFT:%.*]] = shufflevector <8 x i32> [[LV]], <8 x i32> poison, <8 x i32> <i32 6, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
----------------
davemgreen wrote:

Hello. Am I right in saying that the load is not included in the cost? It will be difficult to beat scalarization of the load if this doesn't lead to other optimizations, but I suspect that is not really what the costs are measuring.

Maybe @fhahn remembers more about this specific case. The change you have (pass the mask to the shuffle cost) seems like a sensible optimization. There is a comment that says:
```
  // Aggressively form a vector op if the cost is equal because the transform
  // may enable further optimization.
  // Codegen can reverse this transform (scalarize) if it was not profitable.
```
Maybe it should be more aggressive in the backend at scalarizing. It looks like the costs here should be extract-lane-0 + extract-lane-2 (2) + i32 add (1)   vs   extract-lane-0 + shuffle (1 now?) + v8i32 add (2).  If it could realize that the last v8i32 add was actually a v4i32 add, that might be more accurate (if I have those costs correct).

https://github.com/llvm/llvm-project/pull/114780