[PATCH] D111846: [LV] Drop integer poison-generating flags from instructions that need predication

Tue Nov 9 02:33:19 PST 2021

dcaballe added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:1223
+            State.VF.isVector() &&
+            Legal->blockNeedsPredication(Instr->getParent()) &&
+            cast<Operator>(Instr)->hasPoisonGeneratingFlags())
----------------
fhahn wrote:
> dcaballe wrote:
> > dcaballe wrote:
> > > fhahn wrote:
> > > > Is there a way to avoid querying information from legal here? 
> > > > 
> > > > As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have `hasMask`/`getMask` helpers)? Then we would not need to check legal here, right? 
> > > > 
> > > > Checking `Instrs` location here might not be sufficient in any case I think. There could be cases where `Instr` is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from `Instr`. In any case, it might be good to add such a test case.
> > > > As we are interested in any operation contributing to an address computation of a masked memory operation, could we only seed the worklist with addresses of masked operations instead (both VPWidenMemoryInstructionRecipe & VPInterleaveRecipe have hasMask/getMask helpers)? Then we would not need to check legal here, right?
> > > 
> > > I had tried using `getMask` already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test `load-deref-pred.ll`. We should drop the flags from these cases as well and the only way I found is using `Legal->blockNeedsPredication`. I know it's not ideal but `Legal` is still widely used during codegen.
> > > 
> > > 
> > > > Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.
> > > 
> > > I think that makes sense. It would a case similar to the one that exposed the bug but moving the `subi` instruction outside the condition. I'll give it a try!
> > Using `getMask` can't be used for the interleave recipe either. There are interleave groups that are formed out of non-predicated loads/stores. However, the resulting interleave group may still require a mask if the to-be-generated widen load reads more elements than those needed for the group. We should not drop poison flags in those cases. 
> > I had tried using getMask already. Unfortunately, there are cases where the loads are predicated but the mask has been dropped in favor of executing them speculatively unmasked. See https://reviews.llvm.org/D66688 and test load-deref-pred.ll. We should drop the flags from these cases as well and the only way I found is using Legal->blockNeedsPredication. I know it's not ideal but Legal is still widely used during codegen.
> 
> Hmm, that's unfortunate! There are still plenty of uses of `Legal`, but each new use we add makes things more difficult to transition. Let me take a look if there's another alternative, but if there's none it's not the end of the world for now.
Sure! Note that `blockNeedsPredication` is used in quite a few places so all of them should be addressed in the same way.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:1229
+      // Add new definitions to the worklist.
+      for (VPValue *operand : cast<VPRecipeBase>(CurDef)->operands())
+        if (VPDef *OpDef = operand->getDef())
----------------
fhahn wrote:
> nlopes wrote:
> > dcaballe wrote:
> > > dcaballe wrote:
> > > > fhahn wrote:
> > > > > Do we in some cases also need to drop flags from recipes other then VPReplicateRecipe?
> > > > > 
> > > > > If we need to drop flags from recipes in unpredicted blocks (see comment above). then we could have cases where a vector operation (e.g. VPWidenRecipe) feeds the address computation (the first lane will get extracted). An example could be
> > > > > 
> > > > > ```
> > > > > define void @drop_vector(float* noalias nocapture readonly %input,
> > > > >                                  float* %output, i64* noalias %ints) local_unnamed_addr #0 {
> > > > > entry:
> > > > >   br label %loop.header
> > > > > 
> > > > > loop.header:
> > > > >   %iv = phi i64 [ 0, %entry ], [ %iv.inc, %if.end ]
> > > > >   %i23 = icmp eq i64 %iv, 0
> > > > >   %gep = getelementptr inbounds i64, i64* %ints, i64 %iv
> > > > >   %lv = load i64, i64* %gep
> > > > >   %i27 = sub nuw nsw i64 %iv, 1
> > > > >   %r = add i64 %lv, %i27
> > > > >   store i64 %r, i64* %gep
> > > > >   br i1 %i23, label %if.end, label %if.then
> > > > > 
> > > > > if.then:
> > > > >   %i29 = getelementptr inbounds float, float* %input, i64 %i27
> > > > >   %i30 = load float, float* %i29, align 4, !invariant.load !0
> > > > >   br label %if.end
> > > > > 
> > > > > if.end:
> > > > >   %i34 = phi float [ 0.000000e+00, %loop.header ], [ %i30, %if.then ]
> > > > >   %i35 = getelementptr inbounds float, float* %output, i64 %iv
> > > > >   store float %i34, float* %i35, align 4
> > > > >   %iv.inc = add nuw nsw i64 %iv, 1
> > > > >   %exitcond = icmp eq i64 %iv.inc, 4
> > > > >   br i1 %exitcond, label %loop.exit, label %loop.header
> > > > > 
> > > > > loop.exit:
> > > > >   ret void
> > > > > }
> > > > > attributes #0 = { noinline nounwind uwtable "target-features"="+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl" }
> > > > > 
> > > > > !0 = !{}
> > > > > 
> > > > > ```
> > > > I assumed that address computation would be uniform for those cases and left scalar, wouldn't it?. I'll give it a try but there has been some back and forth wrt dropping the flags from vector instructions already. Based on previous feedback, I'm not sure if we should drop the flags from them when there are lanes that won't generate poison. I would need some clarification before proceeding with those changes (@nlopes). 
> > > Looking closely at your example, the poison value generated by 'sub' in the first iteration of the loop is actually used in the scalar version of the code (it is stored). For that reason, my impression is that having the `nuw` flag in the input `sub` would be incorrect, isn't that the case?
> > > 
> > I mentioned this issue before: what happens if a value was already poison in the first place?
> > 
> > I suggest you commit this patch first, as it addresses half of the problem, and then we can discuss what's the best way to fix the second part. Just dropping flags from even all instructions within the function isn't sufficient as you may get a poison value as input (change the example above to have `[%arg, entry ]` rather than `[ 0, entry ]` for %iv. Though you only need to patch the code for the cases where you cannot prove the code would execute in the scalar version. The right fix isn't trivial.
> I tried reading the discussion again, but I'm not sure why it would matter whether the `sub` gets widened to a vector or not for the test case. In the test case, the issue should still be the same as in the motivating test case, the flags on the instruction can poison (some) vector lanes, independent of whether the inputs were poison originally.
> 
> In the example, the `sub nuw nsw` gets widened, but similar to the other test cases, the first lane is used in the address computation of the masked memory operation. Effectively I am not sure I see the  difference whether we compute only the first lane (because we scalarized  the `sub`) or if we compute all lanes and use the first lane only in a UB-on-poison op and have other non-UB-on-poison uses of the full vector (storing a vector with (some) poison lanes should not be UB).
> 
> If we only restrict this to VPReplicateRecipe, it seems like we can easily run into slight variations of the motivating test case where we still have the same issue as the patch is trying to fix.
> 
> Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.
Semantics on poison-generating flags on vector instructions are not clear to me, even after the discussion. I added support for those cases initially, then removed it... It should be easy to add them again but I want to make sure we agree on that before introducing any changes. If those cases need further discussion I'll be happy to address them in a separate commit. I'm not too concerned about those cases since I don't think LLVM can replace the value of a specific lane with a poison value at this point. 

> Now if we support non-block dropping flags from non-BB-local instruction/recipes, there are cases where we may not need to drop flags, e.g. because we can prove that poison generated by the instruction cause UB unconditionally. In those cases, I think for now our priority should be correctness, as it’s unlikely for poison-generating flags to make a notable difference for the vector loops during codegen.

Not sure I fully understand this part but the key point, as discussed previously, is the usage of the poison value. We may have an instruction that may generates a poison value unconditionally. It should be ok as long as the potential poison value is not used. We should drop the flags if the potential poison value happens to be used after vectorization. That's exactly what the latest implementation is doing.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/drop-poison-generating-flags.ll:82
   %iv = phi i64 [ 0, %entry ], [ %iv.inc, %if.end ]
-  %i27 = sub nuw nsw i64 %iv, 1
-  %i29 = getelementptr inbounds float, float* %input, i64 %i27
+  %i27 = sub i64 %iv, 1
+  %i29 = getelementptr float, float* %input, i64 %i27
----------------
fhahn wrote:
> Should `nuw nsw` be retained in the source here and the same for `inbounds`. It looks like they got dropped.
This is the test you asked me to add, isn't it?

>Checking Instrs location here might not be sufficient in any case I think. There could be cases where Instr is in a block that does not need predication, but feeds a memory instruction in a block that needs predication. In those case, I think we should also drop the flags from Instr. In any case, it might be good to add such a test case.`

`sub` and `getelementptr` are not predicated but feed a load that is predicated.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111846/new/

https://reviews.llvm.org/D111846