[PATCH] D121899: [LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts
David Sherwood via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri May 20 07:10:40 PDT 2022
david-arm added inline comments.
================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.cpp:750-760
+ uint64_t TCVal = 0;
+ if (auto *TC2 = dyn_cast<ConstantInt>(TC))
+ TCVal = TC2->getZExtValue();
+
+ Value *Cond;
+ // When we know there will only be one vector iteration there is no need to
+ // create the comparison, since we already know the answer.
----------------
fhahn wrote:
> david-arm wrote:
> > fhahn wrote:
> > > sdesmalen wrote:
> > > > david-arm wrote:
> > > > > sdesmalen wrote:
> > > > > > nit: How about
> > > > > >
> > > > > > Value *ConstCmp = nullptr;
> > > > > > if (auto *C = dyn_cast<ConstantInt>(TC))
> > > > > > if (C->getZExtValue() <= State.UF * State.VF.getKnownMinValue())
> > > > > > ConstCmp = Builder.getInt1(true);
> > > > > > Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);
> > > > > It would have to be something like this I think:
> > > > >
> > > > > ```Value *ConstCmp = nullptr;
> > > > > if (auto *C = dyn_cast<ConstantInt>(TC)) {
> > > > > uint64_t TCVal = C->getZExtValue();
> > > > > if (TCVal && TCVal <= State.UF * State.VF.getKnownMinValue())
> > > > > ConstCmp = Builder.getInt1(true);
> > > > > }
> > > > > Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);```
> > > > >
> > > > > because I'm pretty sure I found a test where the trip count was actually defined as a zero constant. I'm happy to change it to the code above, but not sure it looks much better to be honest. :)
> > > > If the trip-count is zero, wouldn't `true` be the correct value for the condition?
> > > Hmm, I am not sure if doing this late in codegen is the right place. If we simplify the condition and effectively remove the loop, it would be good to remove the branch-on-count recipe directly in VPlan. I *think* we should have almost everything needed to do that already in place. Let me check.
> > Hi @fhahn, have you had a chance to look into this at all?
> Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully I'll be able to check this week, otherwise I think we should proceed with this patch next week.
So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop vectoriser considers loops like this as having a zero trip count:
entry:
br label %loop.body
loop.body:
%iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ]
%a = extractvalue { i64, i64 } %sv, 0
%b = extractvalue { i64, i64 } %sv, 1
%addr = getelementptr i64, i64* %dst, i32 %iv
%add = add i64 %a, %b
store i64 %add, i64* %addr
%iv.next = add nsw i32 %iv, 1
%cond = icmp ne i32 %iv.next, 0
br i1 %cond, label %loop.body, label %exit
exit:
ret void
even though this loop is actually going to execute UINT32_MAX times before finally the IV overflows back to 0! This came from a real test:
Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
If I fold away the comparison and always branch out of the loop then I've changed the behaviour of the original scalar loop. So I think we either have to:
1. Fix the vectoriser to report the trip count as UINT32_MAX for loops like this, or
2. Only apply my optimisation for low trip counts > 0
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121899/new/
https://reviews.llvm.org/D121899
More information about the llvm-commits
mailing list