[PATCH] D40645: [SCEV][NFC] Check NoWrap flags before lexicographical comparison of SCEVs

Sat Aug 18 11:46:11 PDT 2018

On Fri, Aug 17, 2018 at 6:23 AM Roman Tereshin <rtereshin at apple.com> wrote:
>
> Hi Maxim,
>
> + Eli Friedman
>
> Thank you for a quick response!
>
> and the fact that we may end up assigining NoWrap flags after the SCEV was created is extremely annoying by many means
>
>
> I agree. I would also like to elaborate on this but a bit later.
>
> We've put a lot of efforts to avoid creation of huge SCEVs
>
>
> Feels like this is true, and also for wrong reasons. I found that asymptotically the worst offenders are getUsedLoops / addToLoopUseLists (O(n) where n is the number of nodes in a SCEV, called upon creation of each new SCEV node, a patch is coming soon to fix it) and checkValidity (O(n), called upon each getSCEV call if the result is cached, this one is trickier), having at least one of these around makes creating a large SCEV a O(n^2) deal (transformations are probably even more expensive, true, but they are limited by a small constant depth, while this n is not limited by anything).

Some of this O(n^2) behavior can be fixed trading off memory for compile time:

 - We can get rid of the scan in addToLoopUseLists by keeping a map
(separately as a hashtable or a linked list of loops directly present
in SCEV expressions) from SCEV expressions to the list of used loops.
Then instead of doing a scan we can just look at the loops used by the
immediate operations of a SCEV expression, S, to infer the loops used
by S.

 - We can get rid of checkValidity by keeping a proper use list
between SCEV expressions.  Then when a SCEVUnknown is deleted, we can
walk all of its transitive uses and tag them as invalid.  What we do
today (scan the entire expression for every SCEV) is kind of silly --
we're making a very common operation slow to work around a fairly rare
situation (SCEV preserved across value deletion).  Once we have this
we can reduce the memory used by LoopUsers.  Today it tracks all
transitive users of loops but with use lists it will only have to
track AddRecs and we can construct the transitive set of users in
ScalarEvolution::forgetLoop.

I never got around to doing these because I didn't have enough time to
run the experiments we'd need to run to ensure that these won't
regress memory usage.  Moreover, it wasn't clear to me //what// to
benchmark for compiler memory usage.  But I'll be happy to review
patches if someone wants to take these on.

-- Sanjoy

>
> These are the reason why a patch like https://reviews.llvm.org/D32239 ("[SCEV] Make SCEV or modeling more aggressive.") blows up compile time even when it does not cause a stack overflow. For these, however, I have good test cases to benchmark on (see https://bugs.llvm.org/show_bug.cgi?id=32731 and more is coming).
>
>
> For CompareSCEVComplexity / GroupByComplexity I do not have good benchmarks, unfortunately. I would assume you have some as something motivated your patch!
>
> it still makes sense to do this to save some compile time for adds/muls. Please feel free to either prohibit this for AddRecs or just revert this patch with adding a TODO that it still is a valid thing for add/muls
>
>
> Thank you!
>
> I was thinking adding a simple ranking system (use a few free bits in each SCEV node to store / cache / memoize an integer complexity rank) and use it as a fast path as soon as the particular ordering is not important anymore (after the comparisons by a type and the loop dominance relation for AddRecs, etc.), resorting back to lexicographical comparison only if the ranks are equal.
>
> I would need some benchmarks to evaluate the impact however. Would greatly appreciate if you could find something in your IR archives.
>
> Thank you,
> Roman
>
>
> On Aug 16, 2018, at 11:35 PM, Maxim Kazantsev <max.kazantsev at azul.com> wrote:
>
> UPD: I've looked into the code and we seem to do it only for AddRecs. I don't mind if you revert the patch with checking in your test.
>
> From: Maxim Kazantsev
> Sent: Friday, August 17, 2018 1:32 PM
> To: 'rtereshin at apple.com' <rtereshin at apple.com>
> Cc: sanjoy at playingwithpointers.com; listmail at philipreames.com; javed.absar at arm.com; llvm-commits <llvm-commits at lists.llvm.org>; junbuml at codeaurora.org; wmi at google.com; marcello.maggioni at gmail.com; Justin Bogner <jbogner at apple.com>
> Subject: RE: [PATCH] D40645: [SCEV][NFC] Check NoWrap flags before lexicographical comparison of SCEVs
>
> Hi Roman,
>
> Thanks for finding this issue. I think you are right, and the fact that we may end up assigining NoWrap flags after the SCEV was created is extremely annoying by many means. I think that the underlying problems against which it was made could've been fixed differently. We've put a lot of efforts to avoid creation of huge SCEVs.
>
> As far as I'm aware, we only set NoWraps flag lately for AddRecs and never for Adds/Muls. If it is so, it still makes sense to do this to save some compile time for adds/muls. Please feel free to either prohibit this for AddRecs or just revert this patch with adding a TODO that it still is a valid thing for add/muls and checking in your example.
>
> Thanks,
> Max
>
>
> From: rtereshin at apple.com [mailto:rtereshin at apple.com]
> Sent: Friday, August 17, 2018 1:22 PM
> To: Maxim Kazantsev <max.kazantsev at azul.com>
> Cc: sanjoy at playingwithpointers.com; listmail at philipreames.com; javed.absar at arm.com; llvm-commits <llvm-commits at lists.llvm.org>;junbuml at codeaurora.org; wmi at google.com; marcello.maggioni at gmail.com; Justin Bogner <jbogner at apple.com>
> Subject: Re: [PATCH] D40645: [SCEV][NFC] Check NoWrap flags before lexicographical comparison of SCEVs
>
>
> Hi all,
>
> I just hit a real-world scenario where the problem I've described in the previous comment actually occurs and causes 2 identical SCEV expressions to have different order of operands and therefore compare not equal, which in its turn prevents LoadStoreVectorizer from vectorizing a pair of consecutive loads.
> With the patch reverted the issue goes away.
>
> Please run
>
> $ opt -load-store-vectorizer reproducer.ll -S -o -
>
> to see the difference in vectorization.
>
> If the following patch is also applied:
>
> diff --git a/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
> index e2ee87b2086..74be4d4c87b 100644
> --- a/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
> +++ b/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
> @@ -445,6 +445,12 @@ bool Vectorizer::lookThroughComplexAddresses(Value *PtrA, Value *PtrB,
>    const SCEV *OffsetSCEVB = SE.getSCEV(OpB);
>    const SCEV *C = SE.getConstant(IdxDiff.trunc(BitWidth));
>    const SCEV *X = SE.getAddExpr(OffsetSCEVA, C);
> +  if (OpA->getName() == "preheader.address0.idx" &&
> +      OpB->getName() == "common.address.idx") {
> +    OffsetSCEVA->dump();
> +    X->dump();
> +    OffsetSCEVB->dump();
> +  }
>    return X == OffsetSCEVB;
>  }
>
> you will see the following output as well:
>
> (((56 + (zext i2 (trunc i32 %tmp2 to i2) to i32) + (4 * (zext i1 (trunc i32 (%tmp2 /u 4) to i1) to i32))<nuw><nsw>) * %tmp) + (4 * (zext i1 (trunc i32 (%tmp2 /u 2) to i1) to i32))<nuw><nsw>)
> (1 + ((56 + (zext i2 (trunc i32 %tmp2 to i2) to i32) + (4 * (zext i1 (trunc i32 (%tmp2 /u 4) to i1) to i32))<nuw><nsw>) * %tmp) + (4 * (zext i1 (trunc i32 (%tmp2 /u 2) to i1) to i32))<nuw><nsw>)
> (1 + (4 * (zext i1 (trunc i32 (%tmp2 /u 2) to i1) to i32))<nuw><nsw> + ((56 + (zext i2 (trunc i32 %tmp2 to i2) to i32) + (4 * (zext i1 (trunc i32 (%tmp2 /u 4) to i1) to i32))<nuw><nsw>) * %tmp))
>
> As you can see, second and third SCEVAddExpr's are the same but sorted differently.
>
> On a larger example (the source of the reproducer.ll attached, which is a bugpoint) I have seen even weirder behavior: adding a constant to an existing SCEV changes the order of the existing terms,
> for instance, getAddExpr(1, ((A * B) + (C * D))) returns (1 + (C * D) + (A * B)).
>
>
>
>
> Thanks,
> Roman
>
>
>
> target triple = "x86_64--"
>
> @global_value0 = external constant i32
> @global_value1 = external constant i32
> @other_value = external global float
> @a = external global float
> @b = external global float
> @c = external global float
> @d = external global float
> @plus1 = external global i32
> @cnd = external global i8
>
> ; Function Attrs: nounwind
> define void @main() local_unnamed_addr #0 {
> entry:
>   %tmp = load i32, i32* @global_value0, !range !0
>   %tmp2 = load i32, i32* @global_value1
>   %and.i.i = and i32 %tmp2, 2
>   %add.nuw.nsw.i.i = add nuw nsw i32 %and.i.i, 0
>   %mul.i.i = shl nuw nsw i32 %add.nuw.nsw.i.i, 1
>   %and6.i.i = and i32 %tmp2, 3
>   %and9.i.i = and i32 %tmp2, 4
>   %add.nuw.nsw10.i.i = add nuw nsw i32 %and6.i.i, %and9.i.i
>   %conv3.i42.i = add nuw nsw i32 %mul.i.i, 1
>   %reass.add346.7 = add nuw nsw i32 %add.nuw.nsw10.i.i, 56
>   %reass.mul347.7 = mul nuw nsw i32 %tmp, %reass.add346.7
>   %add7.i.7 = add nuw nsw i32 %reass.mul347.7, 0
>   %preheader.address0.idx = add nuw nsw i32 %add7.i.7, %mul.i.i
>   %preheader.address0.idx.zext = zext i32 %preheader.address0.idx to i64
>   %preheader.load0.address = getelementptr inbounds float, float* @other_value, i64 %preheader.address0.idx.zext
>   %preheader.load0. = load float, float* %preheader.load0.address, align 4, !tbaa !1
>   %common.address.idx = add nuw nsw i32 %add7.i.7, %conv3.i42.i
>   %preheader.header.common.address.idx.zext = zext i32 %common.address.idx to i64
>   %preheader.load1.address = getelementptr inbounds float, float* @other_value, i64 %preheader.header.common.address.idx.zext
>   %preheader.load1. = load float, float* %preheader.load1.address, align 4, !tbaa !1
>   br label %for.body23
>
> for.body23:                                       ; preds = %for.body23, %entry
>   %loop.header.load0.address = getelementptr inbounds float, float* @other_value, i64 %preheader.header.common.address.idx.zext
>   %loop.header.load0. = load float, float* %loop.header.load0.address, align 4, !tbaa !1
>   %reass.mul343.7 = mul nuw nsw i32 %reass.add346.7, 72
>   %add7.i286.7.7 = add nuw nsw i32 %reass.mul343.7, 56
>   %add9.i288.7.7 = add nuw nsw i32 %add7.i286.7.7, %mul.i.i
>   %loop.header.address1.idx = add nuw nsw i32 %add9.i288.7.7, 1
>   %loop.header.address1.idx.zext = zext i32 %loop.header.address1.idx to i64
>   %loop.header.load1.address = getelementptr inbounds float, float* @other_value, i64 %loop.header.address1.idx.zext
>   %loop.header.load1. = load float, float* %loop.header.load1.address, align 4, !tbaa !1
>   store float %preheader.load0., float* @a, align 4, !tbaa !1
>   store float %preheader.load1., float* @b, align 4, !tbaa !1
>   store float %loop.header.load0., float* @c, align 4, !tbaa !1
>   store float %loop.header.load1., float* @d, align 4, !tbaa !1
>   %loaded.cnd = load i8, i8* @cnd
>   %condition = trunc i8 %loaded.cnd to i1
>   br i1 %condition, label %for.body23, label %exit
>
> exit:
>   ret void
> }
>
> attributes #0 = { nounwind }
>
> !0 = !{i32 0, i32 65536}
> !1 = !{!2, !2, i64 0}
> !2 = !{!"float", !3, i64 0}
> !3 = !{!"omnipotent char", !4, i64 0}
> !4 = !{!"Simple C++ TBAA"}
>
>
> > On Aug 10, 2018, at 5:13 PM, Roman Tereshin via Phabricator <reviews at reviews.llvm.org> wrote:
> >
> > rtereshin added inline comments.
> > Herald added a subscriber: javed.absar.
> >
> >
> > ================
> > Comment at: llvm/trunk/lib/Analysis/ScalarEvolution.cpp:699
> > +    if (LA->getNoWrapFlags() != RA->getNoWrapFlags())
> > +      return (int)LA->getNoWrapFlags() - (int)RA->getNoWrapFlags();
> > +
> > ----------------
> > Hi everyone,
> >
> > Unfortunately, wrapping flags are not a part of SCEV's identity (they do not participate in computing a hash value or in equality comparisons) and in fact they could be assigned after the fact w/o rebuilding a SCEV.
> >
> > At least, last time I checked it was that way. Grep for const_cast's to see quite a few of example, AFAIR all for AddRec's.
> >
> > So, if 2 expressions get built in 2 slightly different ways: one with flags set in the beginning, the other with the flags attached later on, we may end up with 2 expressions which are exactly the same but have their operands swapped in one of the commutative N-ary expressions, and at least one of them will have "sorted by complexity" invariant broken.
> >
> > 2 identical SCEV's won't compare equal by pointer comparison as they are supposed to.
> >
> > I didn't verify this with an experiment yet, please correct me if I'm wrong.
> >
> > Thanks,
> > Roman
> >
> >
> > Repository:
> >  rL LLVM
> >
> > https://reviews.llvm.org/D40645
> >
> >
> >
>
>