[llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit unrolling.

Wed May 7 16:08:51 PDT 2014

Per IRC, here is the ll file of the un-unrolled Bubblesort. Experimentally on Haswell I see us return to approximately our original performance at a micro-op threshold of 44 (and around a 15-20% regression between that and 43). This testing is all on my desktop which is not necessarily quiescent and the test is somewhat short so I am not sure if there might a bit of tail coming of on for the last couple percentage points.

Numbers were obtained as follows: 
../bad/bin/opt  -O3 -x86-partial-unrolling-threshold=43  -x86-partial-max-branches=40000 -S < Bubblesort.ll >> BS43.ll
../bad/bin/opt  -O3 -x86-partial-unrolling-threshold=44  -x86-partial-max-branches=40000 -S < Bubblesort.ll >> BS44.ll

Louis

On May 6, 2014, at 10:03 AM, Louis Gerbarg <lgg at apple.com> wrote:

> 
> On May 6, 2014, at 9:41 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
>> ----- Original Message -----
>>> From: "Louis Gerbarg" <lgg at apple.com>
>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Benjamin Kramer" <benny.kra at gmail.com>
>>> Sent: Tuesday, May 6, 2014 11:27:44 AM
>>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit
>>> unrolling.
>>> 
>>> 
>>> 
>>> 
>>> On May 6, 2014, at 12:26 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>>> 
>>> 
>>> ----- Original Message -----
>>> 
>>> 
>>> From: "Louis Gerbarg" < lgg at apple.com >
>>> To: "Benjamin Kramer" < benny.kra at gmail.com >
>>> Cc: "llvm-commits" < llvm-commits at cs.uiuc.edu >
>>> Sent: Monday, May 5, 2014 9:21:32 PM
>>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial
>>> unrolling, use the PartialThreshold to limit
>>> unrolling.
>>> 
>>> I am seeing what appears to be significant regressions on some of the
>>> nightly tests from this patch. For example the following benches all
>>> show 50-100% slowdowns when I apply the patch:
>>> 
>>> SingleSource/Benchmarks/Stanford/Bubblesort
>>> SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog
>>> MultiSource/Benchmarks/FreeBench/neural/neural
>>> 
>>> To be clear, on what system is that?
>>> 
>>> 
>>> I’ve confirmed the regressions personally on Haswell. Looking at our
>>> buildbots it also appears to be occurring on Sandybridge and Penryn
>>> (though with some variation, the bubblesort regression is ~100% on
>>> Haswell, it is 120% on Penryn).
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Generally speaking, if you use -mllvm
>>> -x86-partial-unrolling-threshold=N as you increase the value of N
>>> does that improve things for you? The current values for the
>>> unrolling were taken from the optimization manual (and I'm guessing
>>> is some value like 28 for your system; see
>>> X86TTI::getUnrollingPreferences in
>>> lib/Target/X86/X86TargetTransformInfo.cpp), but was not actually
>>> tuned experimentally. Perhaps this could use some improvement.
>>> 
>>> 
>>> 
>>> 
>>> No. In fact, the regressed numbers are all stable and appear to occur
>>> at any value of -x86-partial-unrolling-threshold. I also see the
>>> regressions even without r207940 if I pass
>>> -x86-partial-unrolling-threshold. Some quick and dirty numbers from
>>> my Haswell system:
>> 
>> That's interesting. Thanks for helping with this! My hypothesis is that either:
>> 1. None of those are large enough (try 60000 and see if it does anything).
>> 2. You're hitting the branch cutoff: try setting -x86-partial-max-branches=40000 (or something else really large) and then try again.
> 
> #2 appears to be the main issue, though it appears an unrolling threshold of 28 is also low enough to be causing some regressions once the cut off is taken care of, at least when r207940 is in the mix.
> 
> Louis
> 
> WITHOUT r207940 and with just larger -x86-partial-unrolling-thresholds:
> bash-3.2$  ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=0  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.032s
> user	0m0.028s
> sys	0m0.001s
> bash-3.2$  ../good/bin/clang -O3   Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.014s
> user	0m0.011s
> sys	0m0.001s
> bash-3.2$  ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=60000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.030s
> user	0m0.027s
> sys	0m0.001s
> bash-3.2$  ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=600000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.031s
> user	0m0.027s
> sys	0m0.001s
> 
> 
> WITHOUT r207940 and with -x86-partial-max-branches:
> bash-3.2$  ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=28 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.015s
> user	0m0.011s
> sys	0m0.001s
> bash-3.2$  ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=0 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.014s
> user	0m0.011s
> sys	0m0.001s
> 
> 
> WITH r207940 and with -x86-partial-max-branches:
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=0 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.030s
> user	0m0.027s
> sys	0m0.001s
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=28 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.017s
> user	0m0.014s
> sys	0m0.001s
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=28 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.016s
> user	0m0.013s
> sys	0m0.001s
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=40 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.017s
> user	0m0.013s
> sys	0m0.001s
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=60 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.015s
> user	0m0.011s
> sys	0m0.001s
> bash-3.2$  ../bad/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=60 -mllvm -x86-partial-max-branches=40000  Bubblesort.c && time ./a.out > /dev/null
> 
> real	0m0.014s
> user	0m0.011s
> sys	0m0.001s
> 
>>> 
>>> 
>>> BASELINE:
>>> 
>>> bash-3.2$ ../good/bin/clang -O3 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.021s
>>> user 0m0.011s
>>> sys 0m0.001s
>>> 
>>> 
>>> WITHOUT r207940:
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.032s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> 
>>> 
>>> 
>>> 
>>> bash-3.2$ ../good/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> 
>>> 
>>> WITH r207940:
>>> bash-3.2$ ../bad/bin/clang -O3 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s..
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.031s
>>> user 0m0.028s
>>> sys 0m0.001s
>>> bash-3.2$ ../bad/bin/clang -O3 -mllvm
>>> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
>>> /dev/null
>>> 
>>> 
>>> real 0m0.030s
>>> user 0m0.027s
>>> sys 0m0.001s
>>> bash-3.2$
>>> 
>>> 
>>> Louis
>>> 
>>> 
>>> -Hal
>>> 
>>> 
>>> 
>>> 
>>> Louis
>>> 
>>> On May 5, 2014, at 3:09 AM, Benjamin Kramer < benny.kra at gmail.com >
>>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> On 05.05.2014, at 01:18, Nadav Rotem < nrotem at apple.com > wrote:
>>> 
>>> 
>>> 
>>> Hi Ben,
>>> 
>>> Thanks for working on this. Overall it sounds like a good change
>>> and unrolling 8 times sounds way too high, even for small loops.
>>> Did you get a chance to measure the performance difference of
>>> this patch?
>>> 
>>> I didn't find any significant runtime change in the test suite or
>>> when trying some of the synthetic benchmarks that were showing
>>> extreme unrolling. Code size is a bit better though.
>>> 
>>> I initially observed this behavior when looking into the vectorizer
>>> ( http://llvm.org/bugs/show_bug.cgi?id=14985 ) For the trivial
>>> loop in the test case we used to unroll 2x in the loop vectorizer
>>> (that's a good thing) and then up to another 8x in the loop
>>> unroller, when we're targeting core2 or higher. I asked Hal and he
>>> agreed that we were unrolling too much.
>>> 
>>> I guess it makes sense to actually use the threshold derived from
>>> the processor manuals to drive unrolling instead of assuming that
>>> more unrolling is better :)
>>> 
>>> - Ben
>>> 
>>> 
>>> 
>>> 
>>> Thanks,
>>> Nadav
>>> 
>>> 
>>> On May 4, 2014, at 12:12 PM, Benjamin Kramer
>>> < benny.kra at googlemail.com > wrote:
>>> 
>>> 
>>> 
>>> Author: d0k
>>> Date: Sun May 4 14:12:38 2014
>>> New Revision: 207940
>>> 
>>> URL: http://llvm.org/viewvc/llvm-project?rev=207940&view=rev
>>> Log:
>>> LoopUnroll: If we're doing partial unrolling, use the
>>> PartialThreshold to limit unrolling.
>>> 
>>> Otherwise we use the same threshold as for complete unrolling,
>>> which is
>>> way too high. This made us unroll any loop smaller than 150
>>> instructions
>>> by 8 times, but only if someone specified -march=core2 or better,
>>> which happens to be the default on darwin.
>>> 
>>> Modified:
>>> llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>> llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>> 
>>> Modified: llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp?rev=207940&r1=207939&r2=207940&view=diff
>>> ==============================================================================
>>> --- llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>> (original)
>>> +++ llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp Sun May
>>> 4 14:12:38 2014
>>> @@ -238,9 +238,12 @@ bool LoopUnroll::runOnLoop(Loop *L, LPPa
>>> return false;
>>> }
>>> uint64_t Size = (uint64_t)LoopSize*Count;
>>> - if (TripCount != 1 && Size > Threshold) {
>>> - DEBUG(dbgs() << " Too large to fully unroll with count: "
>>> << Count
>>> - << " because size: " << Size << ">" << Threshold <<
>>> "\n");
>>> + if (TripCount != 1 &&
>>> + (Size > Threshold || (Count != TripCount && Size >
>>> PartialThreshold))) {
>>> + if (Size > Threshold)
>>> + DEBUG(dbgs() << " Too large to fully unroll with count:
>>> " << Count
>>> + << " because size: " << Size << ">" <<
>>> Threshold << "\n");
>>> +
>>> bool AllowPartial = UserAllowPartial ? CurrentAllowPartial :
>>> UP.Partial;
>>> if (!AllowPartial && !(Runtime && TripCount == 0)) {
>>> DEBUG(dbgs() << " will not try to unroll partially because
>>> "
>>> 
>>> Modified: llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll?rev=207940&r1=207939&r2=207940&view=diff
>>> ==============================================================================
>>> --- llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>> (original)
>>> +++ llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll Sun May
>>> 4 14:12:38 2014
>>> @@ -76,5 +76,52 @@ for.end:
>>> ret void
>>> }
>>> 
>>> +define zeroext i16 @test1(i16* nocapture readonly %arr, i32 %n)
>>> #0 {
>>> +entry:
>>> + %cmp25 = icmp eq i32 %n, 0
>>> + br i1 %cmp25, label %for.end, label %for.body
>>> +
>>> +for.body: ; preds =
>>> %entry, %for.body
>>> + %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
>>> %entry ]
>>> + %reduction.026 = phi i16 [ %add14, %for.body ], [ 0, %entry ]
>>> + %arrayidx = getelementptr inbounds i16* %arr, i64 %indvars.iv
>>> + %0 = load i16* %arrayidx, align 2
>>> + %add = add i16 %0, %reduction.026
>>> + %sext = mul i64 %indvars.iv, 12884901888
>>> + %idxprom3 = ashr exact i64 %sext, 32
>>> + %arrayidx4 = getelementptr inbounds i16* %arr, i64 %idxprom3
>>> + %1 = load i16* %arrayidx4, align 2
>>> + %add7 = add i16 %add, %1
>>> + %sext28 = mul i64 %indvars.iv, 21474836480
>>> + %idxprom10 = ashr exact i64 %sext28, 32
>>> + %arrayidx11 = getelementptr inbounds i16* %arr, i64 %idxprom10
>>> + %2 = load i16* %arrayidx11, align 2
>>> + %add14 = add i16 %add7, %2
>>> + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>> + %lftr.wideiv = trunc i64 %indvars.iv.next to i32
>>> + %exitcond = icmp eq i32 %lftr.wideiv, %n
>>> + br i1 %exitcond, label %for.end, label %for.body
>>> +
>>> +for.end: ; preds =
>>> %for.body, %entry
>>> + %reduction.0.lcssa = phi i16 [ 0, %entry ], [ %add14,
>>> %for.body ]
>>> + ret i16 %reduction.0.lcssa
>>> +
>>> +; This loop is too large to be partially unrolled (size=16)
>>> +
>>> +; CHECK-LABEL: @test1
>>> +; CHECK: br
>>> +; CHECK: br
>>> +; CHECK: br
>>> +; CHECK: br
>>> +; CHECK-NOT: br
>>> +
>>> +; CHECK-NOUNRL-LABEL: @test1
>>> +; CHECK-NOUNRL: br
>>> +; CHECK-NOUNRL: br
>>> +; CHECK-NOUNRL: br
>>> +; CHECK-NOUNRL: br
>>> +; CHECK-NOUNRL-NOT: br
>>> +}
>>> +
>>> attributes #0 = { nounwind uwtable }
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> 
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> 
>> 
>> -- 
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140507/28943751/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Bubblesort.ll
Type: application/octet-stream
Size: 9982 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140507/28943751/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140507/28943751/attachment-0001.html>