[PATCH] D112725: [LoopVectorize] Extract the last lane from a uniform store

Thu Nov 4 10:23:05 PDT 2021

kmclaughlin added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5480
+      // A uniform memory op is itself uniform.
+      if (Legal->isUniformMemOp(I))
         addToWorklistIfAllowed(&I);
----------------
fhahn wrote:
> Unfortunately this is incorrect now. The worklist currently relies on instructions only demanding the first lane, and this is used to propagate this property later on, using: if all users of an operand demand the first lane only, the operand itself also only needs to compute the first lane. I added a clarifying comment in b4992dbb21ff9159285ae0aec73f3d760344b0e5
> 
> Adding stores violates that. You should probably be able to work around that by adding them to `Uniforms` without adding them to the worklist.  It might be worth calling out that entries in Uniforms may demand the first or last lane.
Thank you for adding a comment to addToWorklistIfAllowed, @fhahn. I've changed this so that store instructions are only added to `Uniforms` as suggested and added some comments which I hope makes this clear.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/illegal-parallel-loop-uniform-write.ll:85
+; CHECK-NEXT:    [[TMP22:%.*]] = add nsw i32 [[TMP21]], 1
+; CHECK-NEXT:    store i32 [[TMP22]], i32* [[ARRAYIDX7_US]], align 4, !llvm.mem.parallel_loop_access !0
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
----------------
david-arm wrote:
> Hi @kmclaughlin, I don't think this looks right sadly. I wonder if now we're marking the store as uniform that you've exposed an existing bug in `collectLoopUniforms` or somewhere else like `handleReplication`? It looks like we're also now treating the the load as uniform, which is wrong in this case because we still want to do the vector load and store out the last lane. I'd expected something like:
> 
>   [[WIDE_LOAD:%.*]] = load <4 x i32>, <4 x i32>* [[TMP22]], align 4
>   [[TMP23:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], <i32 1, i32 1, i32 1, i32 1>
>   [[TMP27:%.*]] = extractelement <4 x i32> [[TMP23]], i32 3
>   store i32 [[TMP27]], i32* [[ARRAYIDX7_US]], align 4
> 
Thanks @david-arm, the load instruction was incorrectly being marked as uniform here. I think now that I've changed collectLoopUniforms so that stores are not added to the worklist, the output of this test looks as expected?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112725/new/

https://reviews.llvm.org/D112725