[PATCH] D121899: [LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts

Fri May 20 07:10:40 PDT 2022

david-arm added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.cpp:750-760
+    uint64_t TCVal = 0;
+    if (auto *TC2 = dyn_cast<ConstantInt>(TC))
+      TCVal = TC2->getZExtValue();
+
+    Value *Cond;
+    // When we know there will only be one vector iteration there is no need to
+    // create the comparison, since we already know the answer.
----------------
fhahn wrote:
> david-arm wrote:
> > fhahn wrote:
> > > sdesmalen wrote:
> > > > david-arm wrote:
> > > > > sdesmalen wrote:
> > > > > > nit: How about
> > > > > > 
> > > > > >   Value *ConstCmp = nullptr;
> > > > > >   if (auto *C = dyn_cast<ConstantInt>(TC))
> > > > > >     if (C->getZExtValue() <= State.UF * State.VF.getKnownMinValue())
> > > > > >       ConstCmp = Builder.getInt1(true);
> > > > > >   Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);
> > > > > It would have to be something like this I think:
> > > > > 
> > > > > ```Value *ConstCmp = nullptr;
> > > > > if (auto *C = dyn_cast<ConstantInt>(TC)) {
> > > > >   uint64_t TCVal = C->getZExtValue();
> > > > >   if (TCVal && TCVal  <= State.UF * State.VF.getKnownMinValue())
> > > > >     ConstCmp = Builder.getInt1(true);
> > > > > }
> > > > > Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);```
> > > > > 
> > > > > because I'm pretty sure I found a test where the trip count was actually defined as a zero constant. I'm happy to change it to the code above, but not sure it looks much better to be honest. :)
> > > > If the trip-count is zero, wouldn't `true` be the correct value for the condition?
> > > Hmm,  I am not sure if doing this late in codegen is the right place. If we simplify the condition and effectively remove the loop, it would be good to remove the branch-on-count recipe directly in VPlan. I *think* we should have almost everything needed to do that already in place. Let me check.
> > Hi @fhahn, have you had a chance to look into this at all?
> Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully I'll be able to check this week, otherwise I think we should proceed with this patch next week.
So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop vectoriser considers loops like this as having a zero trip count:

  entry:
    br label %loop.body

  loop.body:
    %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ]
    %a = extractvalue { i64, i64 } %sv, 0
    %b = extractvalue { i64, i64 } %sv, 1
    %addr = getelementptr i64, i64* %dst, i32 %iv
    %add = add i64 %a, %b
    store i64 %add, i64* %addr
    %iv.next = add nsw i32 %iv, 1
    %cond = icmp ne i32 %iv.next, 0
    br i1 %cond, label %loop.body, label %exit

  exit:
    ret void

even though this loop is actually going to execute UINT32_MAX times before finally the IV overflows back to 0! This came from a real test:

  Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

If I fold away the comparison and always branch out of the loop then I've changed the behaviour of the original scalar loop. So I think we either have to:

1. Fix the vectoriser to report the trip count as UINT32_MAX for loops like this, or
2. Only apply my optimisation for low trip counts > 0

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121899/new/

https://reviews.llvm.org/D121899