[PATCH] D132892: [LV] Explicily lower uniform load as single load

Wed Aug 31 13:20:18 PDT 2022

reames abandoned this revision.
reames added a comment.

In D132892#3761149 <https://reviews.llvm.org/D132892#3761149>, @fhahn wrote:

>> This replaces the "scalarize then clone-as-uniform" lowering for loads of uniform values with a dedicated mode of memory widening for this case. This avoids needing to relying on instcombine/GVN to clean up after redundant loads.
>
> IIUC the main issue is that `VPReplicateRecipe::isUniform` means 'uniform-per-part/uniform-per-VF', not 'uniform-across-all-parts'.  Unfortunately the terminology is not really used very consistently across different parts of LV. We should be able to catch some cases through improving lowering for uniform VPReplicateRecipes directly:  D133019 <https://reviews.llvm.org/D133019>.

I went ahead and accepted that one.  I don't really care which approach we use here.  This one felt slightly cleaner to me when considering store handling, but a) we can revisit if desired when adding stores, and b) incremental progress is good.

> If we want to use `isUniformMemOp` to determine the `uniform-across-all-parts` property, I think it would make sense to extend `VPReplicateRecipe` to distinguish between our 2 versions of uniformity. AFAIK `VPWidenXXX` naming scheme is meant to indicate that the recipe will be widened (i.e. a wide vector instruction will be generated), but for uniform loads we only generate a single scalar instruction).

The result of the operation is still the widened vector type.  It just happens to come from a splat of a single load.  That would be my argument for this approach over yours, but as I said, I really don't care.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4712

+      // Only try to scalarize the uniform memop itself if legal and we're not
+      // using the direct lowering strategy (which is strictly better).
----------------
david-arm wrote:
> I'm not sure what the "if legal" bit refers to here? If we're treating this as uniform then it's because Legal->isUniformMemOp has returned true. I think you could just say:
> 
>   // Only try to scalarize the uniform memop itself if we're not using the direct lowering ...
> 
We can't scalarize a scalable uniform store if the value being stored is not loop invariant.  Because we don't know which lane to store, and the general scalarization support does not handle predicated scalable vectors.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9746

+    // FLAGIT
+    assert(!isUniformMemOp() &&
----------------
david-arm wrote:
> Not sure what this comment means?
Stray change.  This is the token I use for easy local search, should have been removed before posting.  

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9786
+                                              Alignment);
+    // Add metadata to the load, but setVectorValue to the reverse shuffle.
+    State.addMetadata(NewLI, LI);
----------------
david-arm wrote:
> Not sure I understand this comment? Why does this involve a reverse shuffle?
I coped the comment and forgot to update it.  Will fix.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/cost-model-assert.ll:26
 ; CHECK:       vector.ph:
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 4
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
----------------
david-arm wrote:
> This looks a bit unexpected. Do you know why the minimum iteration check has changed here from 8 to 4? It sounds like we've made a different choice of VF and/or interleave count (IC). I can only assume that the cost of some instructions has changed?
Exactly.  If I remember correctly, this is because uniform instructions are never considered possible to scalarize, and I didn't add the same check for the CM_Uniform.  I'd decided it didn't really matter since this test wasn't about codegen (from the name), and was instead a no-crash test.  

================
Comment at: llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll:476
 ; CHECK-NEXT:   loop.0:
-; CHECK-NEXT:     CLONE ir<[[L]]> = load ir<%src>
+; CHECK-NEXT:     REPLICATE ir<[[L]]> = load ir<%src>
 ; CHECK-NEXT:     EMIT vp<[[SPLICE:%.+]]> = first-order splice ir<%.pn> ir<[[L]]>
----------------
david-arm wrote:
> Do you know why this has changed? As far as I understand it, 'CLONE' literally means clone the scalar instruction once, whereas REPLICATE means insert copies into each lane of the vector. So in a sense we're now doing more loads than before I think?
This one is definitely the profitable to scalarize point explained just above.  I dug into this one in some depth.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132892/new/

https://reviews.llvm.org/D132892