[PATCH] D154314: [LV] Remove the reminder loop if we know the mask is always true

Wed Jul 12 02:42:31 PDT 2023

v01dXYZ added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll:34
 ; CHECK:       middle.block:
-; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
-; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
----------------
v01dXYZ wrote:
> david-arm wrote:
> > Allen wrote:
> > > david-arm wrote:
> > > > Allen wrote:
> > > > > david-arm wrote:
> > > > > > I'm guessing that InstCombine does not determine this is guaranteed to always be true? However, I thought that someone did work in the DAGCombiner that will replace this with
> > > > > > 
> > > > > >   br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
> > > > > > 
> > > > > > when vscale is known to be a power of 2? Are you hoping to benefit from eliminating the scalar tail in IR because it helps us to make better decisions later in the pipeline? I can imagine it's beneficial for LTO where the scalar tail could prevent inlining.
> > > > > > 
> > > > > > If I remember correctly one of the problems with folding away the icmp in InstCombine is that it doesn't have access to the TTI interface so we cannot query the target.
> > > > > I may not have caught your idea, are you saying that the current optimization needs to be handled in combinine ?
> > > > Well, I'm just trying to understand what this patch is trying to achieve that's all. I'm not against it because it does clean up the IR generated by the vectoriser. However, I'm not sure if expect you to see many real performance gains from doing this because we should delete the scalar tail during codegen.
> > > > 
> > > > It might also be worth investigating whether or not InstCombine already optimises the urem calculation, similar to what this patch (https://reviews.llvm.org/D129609) did in codegen.
> > > Yes, I think it's going to benefit beside the codesize in the following scenario.
> > >   # When the loop is a nested loop, the trip count of the inner loop is an power of 2, so eliminating the tail block reduces branch jumps.
> > > 
> > > BTW, the AArch64 backend already supported the **isVScaleKnownToBeAPowerOfTwo** in patch (https://reviews.llvm.org/D146199).
> > Sure, you're absolutely right about about the benefit of deleting the tail! I'm just trying to say that we should be deleting the tail already as part of codegen. If you look at the final assembly output for the loop I hope that there is no tail because the DAGCombiner + MachineDCE should have killed it off.
> > 
> > So this patch is then about trying to delete the tail *even earlier* in IR. It's a nice cleanup, but I was thinking it would be nice if InstCombine could do this for us because that would be even more powerful.
> > 
> Do you know where to start in order to understand how the two following passes can possibly simplify the comparison result ?
> 
> * `InstCombine`
> * `DAGCombiner`
> 
> They would need to propagate the fact vscale is a power of two within a given interval in order to simplify `urem` operations, similar to what is done by interval partition or scalar evolution. Do you think those two passes are powerful enough to do this kind of analysis ?
> 
*edit: I meant range analysis, not interval partition.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154314/new/

https://reviews.llvm.org/D154314