[llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit unrolling.

Thu May 8 10:18:07 PDT 2014

That is what I got out of "clang -S -emit-llvm -O3  -mllvm -x86-partial-unrolling-threshold=0  -mllvm -x86-partial-max-branches=0 Bubblesort.c” 

If you want me to prep it some other way I can. The source is part of the test-suite if you would prefer to look at it yourself: http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Stanford/Bubblesort.c?revision=6132&view=markup

Louis

On May 8, 2014, at 2:03 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> Louis,
> 
> Thanks! I don't think this is the un-unrolled version, however. Running the unroller shows:
> 
> Loop Unroll: F[Bubble] Loop %
>  Loop Size = 40
>  Too large to fully unroll with count: 8 because size: 320>150
>  could not unroll partially
> Loop Unroll: F[Bubble] Loop %
>  Loop Size = 49
>  Too large to fully unroll with count: 8 because size: 392>150
>  could not unroll partially
> 
> Could you please double-check?
> 
> -Hal
> 
> ----- Original Message -----
>> From: "Louis Gerbarg" <lgg at apple.com>
>> To: "Hal J. Finkel" <hfinkel at anl.gov>
>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Benjamin Kramer" <benny.kra at gmail.com>
>> Sent: Wednesday, May 7, 2014 6:08:51 PM
>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit
>> unrolling.
>> 
>> 
>> Per IRC, here is the ll file of the un-unrolled Bubblesort.
>> Experimentally on Haswell I see us return to approximately our
>> original performance at a micro-op threshold of 44 (and around a
>> 15-20% regression between that and 43). This testing is all on my
>> desktop which is not necessarily quiescent and the test is somewhat
>> short so I am not sure if there might a bit of tail coming of on for
>> the last couple percentage points.
>> 
>> 
>> Numbers were obtained as follows:
>> 
>> ../bad/bin/opt -O3 -x86-partial-unrolling-threshold=43
>> -x86-partial-max-branches=40000 -S < Bubblesort.ll >> BS43.ll
>> 
>> ../bad/bin/opt -O3 -x86-partial-unrolling-threshold=44
>> -x86-partial-max-branches=40000 -S < Bubblesort.ll >> BS44.ll
>> 
>> 
>> Louis
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On May 6, 2014, at 10:03 AM, Louis Gerbarg < lgg at apple.com > wrote:
>> 
>> 
>> 
>> 
>> 
>> On May 6, 2014, at 9:41 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Louis Gerbarg" < lgg at apple.com >
>> To: "Hal Finkel" < hfinkel at anl.gov >
>> Cc: "llvm-commits" < llvm-commits at cs.uiuc.edu >, "Benjamin Kramer" <
>> benny.kra at gmail.com >
>> Sent: Tuesday, May 6, 2014 11:27:44 AM
>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial
>> unrolling, use the PartialThreshold to limit
>> unrolling.
>> 
>> 
>> 
>> 
>> On May 6, 2014, at 12:26 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Louis Gerbarg" < lgg at apple.com >
>> To: "Benjamin Kramer" < benny.kra at gmail.com >
>> Cc: "llvm-commits" < llvm-commits at cs.uiuc.edu >
>> Sent: Monday, May 5, 2014 9:21:32 PM
>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial
>> unrolling, use the PartialThreshold to limit
>> unrolling.
>> 
>> I am seeing what appears to be significant regressions on some of the
>> nightly tests from this patch. For example the following benches all
>> show 50-100% slowdowns when I apply the patch:
>> 
>> SingleSource/Benchmarks/Stanford/Bubblesort
>> SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog
>> MultiSource/Benchmarks/FreeBench/neural/neural
>> 
>> To be clear, on what system is that?
>> 
>> 
>> I’ve confirmed the regressions personally on Haswell. Looking at our
>> buildbots it also appears to be occurring on Sandybridge and Penryn
>> (though with some variation, the bubblesort regression is ~100% on
>> Haswell, it is 120% on Penryn).
>> 
>> 
>> 
>> 
>> 
>> Generally speaking, if you use -mllvm
>> -x86-partial-unrolling-threshold=N as you increase the value of N
>> does that improve things for you? The current values for the
>> unrolling were taken from the optimization manual (and I'm guessing
>> is some value like 28 for your system; see
>> X86TTI::getUnrollingPreferences in
>> lib/Target/X86/X86TargetTransformInfo.cpp), but was not actually
>> tuned experimentally. Perhaps this could use some improvement.
>> 
>> 
>> 
>> 
>> No. In fact, the regressed numbers are all stable and appear to occur
>> at any value of -x86-partial-unrolling-threshold. I also see the
>> regressions even without r207940 if I pass
>> -x86-partial-unrolling-threshold. Some quick and dirty numbers from
>> my Haswell system:
>> 
>> That's interesting. Thanks for helping with this! My hypothesis is
>> that either:
>> 1. None of those are large enough (try 60000 and see if it does
>> anything).
>> 2. You're hitting the branch cutoff: try setting
>> -x86-partial-max-branches=40000 (or something else really large) and
>> then try again.
>> 
>> 
>> #2 appears to be the main issue, though it appears an unrolling
>> threshold of 28 is also low enough to be causing some regressions
>> once the cut off is taken care of, at least when r207940 is in the
>> mix.
>> 
>> 
>> Louis
>> 
>> 
>> 
>> WITHOUT r207940 and with just larger
>> -x86-partial-unrolling-thresholds:
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.032s
>> user 0m0.028s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.014s
>> user 0m0.011s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=60000 Bubblesort.c && time ./a.out
>>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=600000 Bubblesort.c && time ./a.out
>>> /dev/null
>> 
>> 
>> real 0m0.031s
>> user 0m0.027s
>> sys 0m0.001s
>> 
>> 
>> 
>> 
>> WITHOUT r207940 and with -x86-partial-max-branches:
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=28 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.015s
>> user 0m0.011s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=0 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.014s
>> user 0m0.011s
>> sys 0m0.001s
>> 
>> 
>> 
>> 
>> WITH r207940 and with -x86-partial-max-branches:
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=0 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=28 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.017s
>> user 0m0.014s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=28 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.016s
>> user 0m0.013s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=40 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.017s
>> user 0m0.013s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=60 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.015s
>> user 0m0.011s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=60 -mllvm
>> -x86-partial-max-branches=40000 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.014s
>> user 0m0.011s
>> sys 0m0.001s
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> BASELINE:
>> 
>> bash-3.2$ ../good/bin/clang -O3 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.021s
>> user 0m0.011s
>> sys 0m0.001s
>> 
>> 
>> WITHOUT r207940:
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.032s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> 
>> 
>> 
>> 
>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> 
>> 
>> WITH r207940:
>> bash-3.2$ ../bad/bin/clang -O3 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s..
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.031s
>> user 0m0.028s
>> sys 0m0.001s
>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
>> /dev/null
>> 
>> 
>> real 0m0.030s
>> user 0m0.027s
>> sys 0m0.001s
>> bash-3.2$
>> 
>> 
>> Louis
>> 
>> 
>> -Hal
>> 
>> 
>> 
>> 
>> Louis
>> 
>> On May 5, 2014, at 3:09 AM, Benjamin Kramer < benny.kra at gmail.com >
>> wrote:
>> 
>> 
>> 
>> 
>> On 05.05.2014, at 01:18, Nadav Rotem < nrotem at apple.com > wrote:
>> 
>> 
>> 
>> Hi Ben,
>> 
>> Thanks for working on this. Overall it sounds like a good change
>> and unrolling 8 times sounds way too high, even for small loops.
>> Did you get a chance to measure the performance difference of
>> this patch?
>> 
>> I didn't find any significant runtime change in the test suite or
>> when trying some of the synthetic benchmarks that were showing
>> extreme unrolling. Code size is a bit better though.
>> 
>> I initially observed this behavior when looking into the vectorizer
>> ( http://llvm.org/bugs/show_bug.cgi?id=14985 ) For the trivial
>> loop in the test case we used to unroll 2x in the loop vectorizer
>> (that's a good thing) and then up to another 8x in the loop
>> unroller, when we're targeting core2 or higher. I asked Hal and he
>> agreed that we were unrolling too much.
>> 
>> I guess it makes sense to actually use the threshold derived from
>> the processor manuals to drive unrolling instead of assuming that
>> more unrolling is better :)
>> 
>> - Ben
>> 
>> 
>> 
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> On May 4, 2014, at 12:12 PM, Benjamin Kramer
>> < benny.kra at googlemail.com > wrote:
>> 
>> 
>> 
>> Author: d0k
>> Date: Sun May 4 14:12:38 2014
>> New Revision: 207940
>> 
>> URL: http://llvm.org/viewvc/llvm-project?rev=207940&view=rev
>> Log:
>> LoopUnroll: If we're doing partial unrolling, use the
>> PartialThreshold to limit unrolling.
>> 
>> Otherwise we use the same threshold as for complete unrolling,
>> which is
>> way too high. This made us unroll any loop smaller than 150
>> instructions
>> by 8 times, but only if someone specified -march=core2 or better,
>> which happens to be the default on darwin.
>> 
>> Modified:
>> llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>> llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>> 
>> Modified: llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp?rev=207940&r1=207939&r2=207940&view=diff
>> ==============================================================================
>> --- llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>> (original)
>> +++ llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp Sun May
>> 4 14:12:38 2014
>> @@ -238,9 +238,12 @@ bool LoopUnroll::runOnLoop(Loop *L, LPPa
>> return false;
>> }
>> uint64_t Size = (uint64_t)LoopSize*Count;
>> - if (TripCount != 1 && Size > Threshold) {
>> - DEBUG(dbgs() << " Too large to fully unroll with count: "
>> << Count
>> - << " because size: " << Size << ">" << Threshold <<
>> "\n");
>> + if (TripCount != 1 &&
>> + (Size > Threshold || (Count != TripCount && Size >
>> PartialThreshold))) {
>> + if (Size > Threshold)
>> + DEBUG(dbgs() << " Too large to fully unroll with count:
>> " << Count
>> + << " because size: " << Size << ">" <<
>> Threshold << "\n");
>> +
>> bool AllowPartial = UserAllowPartial ? CurrentAllowPartial :
>> UP.Partial;
>> if (!AllowPartial && !(Runtime && TripCount == 0)) {
>> DEBUG(dbgs() << " will not try to unroll partially because
>> "
>> 
>> Modified: llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll?rev=207940&r1=207939&r2=207940&view=diff
>> ==============================================================================
>> --- llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>> (original)
>> +++ llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll Sun May
>> 4 14:12:38 2014
>> @@ -76,5 +76,52 @@ for.end:
>> ret void
>> }
>> 
>> +define zeroext i16 @test1(i16* nocapture readonly %arr, i32 %n)
>> #0 {
>> +entry:
>> + %cmp25 = icmp eq i32 %n, 0
>> + br i1 %cmp25, label %for.end, label %for.body
>> +
>> +for.body: ; preds =
>> %entry, %for.body
>> + %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
>> %entry ]
>> + %reduction.026 = phi i16 [ %add14, %for.body ], [ 0, %entry ]
>> + %arrayidx = getelementptr inbounds i16* %arr, i64 %indvars.iv
>> + %0 = load i16* %arrayidx, align 2
>> + %add = add i16 %0, %reduction.026
>> + %sext = mul i64 %indvars.iv, 12884901888
>> + %idxprom3 = ashr exact i64 %sext, 32
>> + %arrayidx4 = getelementptr inbounds i16* %arr, i64 %idxprom3
>> + %1 = load i16* %arrayidx4, align 2
>> + %add7 = add i16 %add, %1
>> + %sext28 = mul i64 %indvars.iv, 21474836480
>> + %idxprom10 = ashr exact i64 %sext28, 32
>> + %arrayidx11 = getelementptr inbounds i16* %arr, i64 %idxprom10
>> + %2 = load i16* %arrayidx11, align 2
>> + %add14 = add i16 %add7, %2
>> + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>> + %lftr.wideiv = trunc i64 %indvars.iv.next to i32
>> + %exitcond = icmp eq i32 %lftr.wideiv, %n
>> + br i1 %exitcond, label %for.end, label %for.body
>> +
>> +for.end: ; preds =
>> %for.body, %entry
>> + %reduction.0.lcssa = phi i16 [ 0, %entry ], [ %add14,
>> %for.body ]
>> + ret i16 %reduction.0.lcssa
>> +
>> +; This loop is too large to be partially unrolled (size=16)
>> +
>> +; CHECK-LABEL: @test1
>> +; CHECK: br
>> +; CHECK: br
>> +; CHECK: br
>> +; CHECK: br
>> +; CHECK-NOT: br
>> +
>> +; CHECK-NOUNRL-LABEL: @test1
>> +; CHECK-NOUNRL: br
>> +; CHECK-NOUNRL: br
>> +; CHECK-NOUNRL: br
>> +; CHECK-NOUNRL: br
>> +; CHECK-NOUNRL-NOT: br
>> +}
>> +
>> attributes #0 = { nounwind uwtable }
>> 
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140508/430d58f4/attachment.html>