[llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit unrolling.

Tue May 6 09:27:44 PDT 2014

On May 6, 2014, at 12:26 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Louis Gerbarg" <lgg at apple.com>
>> To: "Benjamin Kramer" <benny.kra at gmail.com>
>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
>> Sent: Monday, May 5, 2014 9:21:32 PM
>> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial unrolling,	use the PartialThreshold to limit
>> unrolling.
>> 
>> I am seeing what appears to be significant regressions on some of the
>> nightly tests from this patch. For example the following benches all
>> show 50-100% slowdowns when I apply the patch:
>> 
>> SingleSource/Benchmarks/Stanford/Bubblesort
>> SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog
>> MultiSource/Benchmarks/FreeBench/neural/neural
> 
> To be clear, on what system is that?

I’ve confirmed the regressions personally on Haswell. Looking at our buildbots it also appears to be occurring on Sandybridge and Penryn (though with some variation, the bubblesort regression is ~100% on Haswell, it is 120% on Penryn).

> Generally speaking, if you use -mllvm -x86-partial-unrolling-threshold=N as you increase the value of N does that improve things for you? The current values for the unrolling were taken from the optimization manual (and I'm guessing is some value like 28 for your system; see X86TTI::getUnrollingPreferences in lib/Target/X86/X86TargetTransformInfo.cpp), but was not actually tuned experimentally. Perhaps this could use some improvement.
> 

No. In fact, the regressed numbers are all stable and appear to occur at any value of -x86-partial-unrolling-threshold. I also see the regressions even without r207940 if I pass -x86-partial-unrolling-threshold. Some quick and dirty numbers from my Haswell system:

BASELINE:
bash-3.2$ ../good/bin/clang -O3  Bubblesort.c && time ./a.out > /dev/null

real	0m0.021s
user	0m0.011s
sys	0m0.001s

WITHOUT r207940:
bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=0  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.032s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=18  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=28  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=40  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=60  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s

bash-3.2$ ../good/bin/clang -O3  -mllvm -x86-partial-unrolling-threshold=600  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s

WITH r207940:
bash-3.2$ ../bad/bin/clang -O3  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s..
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=0  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=18  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=28  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=40  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=60  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.031s
user	0m0.028s
sys	0m0.001s
bash-3.2$ ../bad/bin/clang -O3 -mllvm -x86-partial-unrolling-threshold=600  Bubblesort.c && time ./a.out > /dev/null 

real	0m0.030s
user	0m0.027s
sys	0m0.001s
bash-3.2$ 

Louis

> -Hal
> 
>> 
>> Louis
>> 
>> On May 5, 2014, at 3:09 AM, Benjamin Kramer <benny.kra at gmail.com>
>> wrote:
>> 
>>> 
>>> On 05.05.2014, at 01:18, Nadav Rotem <nrotem at apple.com> wrote:
>>> 
>>>> Hi Ben,
>>>> 
>>>> Thanks for working on this. Overall it sounds like a good change
>>>> and unrolling 8 times sounds way too high, even for small loops.
>>>> Did you get a chance to measure the performance difference of
>>>> this patch?
>>> 
>>> I didn't find any significant runtime change in the test suite or
>>> when trying some of the synthetic benchmarks that were showing
>>> extreme unrolling. Code size is a bit better though.
>>> 
>>> I initially observed this behavior when looking into the vectorizer
>>> ( http://llvm.org/bugs/show_bug.cgi?id=14985 ) For the trivial
>>> loop in the test case we used to unroll 2x in the loop vectorizer
>>> (that's a good thing) and then up to another 8x in the loop
>>> unroller, when we're targeting core2 or higher. I asked Hal and he
>>> agreed that we were unrolling too much.
>>> 
>>> I guess it makes sense to actually use the threshold derived from
>>> the processor manuals to drive unrolling instead of assuming that
>>> more unrolling is better :)
>>> 
>>> - Ben
>>> 
>>>> 
>>>> Thanks,
>>>> Nadav
>>>> 
>>>> 
>>>> On May 4, 2014, at 12:12 PM, Benjamin Kramer
>>>> <benny.kra at googlemail.com> wrote:
>>>> 
>>>>> Author: d0k
>>>>> Date: Sun May  4 14:12:38 2014
>>>>> New Revision: 207940
>>>>> 
>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=207940&view=rev
>>>>> Log:
>>>>> LoopUnroll: If we're doing partial unrolling, use the
>>>>> PartialThreshold to limit unrolling.
>>>>> 
>>>>> Otherwise we use the same threshold as for complete unrolling,
>>>>> which is
>>>>> way too high. This made us unroll any loop smaller than 150
>>>>> instructions
>>>>> by 8 times, but only if someone specified -march=core2 or better,
>>>>> which happens to be the default on darwin.
>>>>> 
>>>>> Modified:
>>>>> llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>>>> llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>>>> 
>>>>> Modified: llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>>>> URL:
>>>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp?rev=207940&r1=207939&r2=207940&view=diff
>>>>> ==============================================================================
>>>>> --- llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
>>>>> (original)
>>>>> +++ llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp Sun May
>>>>> 4 14:12:38 2014
>>>>> @@ -238,9 +238,12 @@ bool LoopUnroll::runOnLoop(Loop *L, LPPa
>>>>>    return false;
>>>>>  }
>>>>>  uint64_t Size = (uint64_t)LoopSize*Count;
>>>>> -    if (TripCount != 1 && Size > Threshold) {
>>>>> -      DEBUG(dbgs() << "  Too large to fully unroll with count: "
>>>>> << Count
>>>>> -            << " because size: " << Size << ">" << Threshold <<
>>>>> "\n");
>>>>> +    if (TripCount != 1 &&
>>>>> +        (Size > Threshold || (Count != TripCount && Size >
>>>>> PartialThreshold))) {
>>>>> +      if (Size > Threshold)
>>>>> +        DEBUG(dbgs() << "  Too large to fully unroll with count:
>>>>> " << Count
>>>>> +                     << " because size: " << Size << ">" <<
>>>>> Threshold << "\n");
>>>>> +
>>>>>    bool AllowPartial = UserAllowPartial ? CurrentAllowPartial :
>>>>>    UP.Partial;
>>>>>    if (!AllowPartial && !(Runtime && TripCount == 0)) {
>>>>>      DEBUG(dbgs() << "  will not try to unroll partially because
>>>>>      "
>>>>> 
>>>>> Modified: llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>>>> URL:
>>>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll?rev=207940&r1=207939&r2=207940&view=diff
>>>>> ==============================================================================
>>>>> --- llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
>>>>> (original)
>>>>> +++ llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll Sun May
>>>>> 4 14:12:38 2014
>>>>> @@ -76,5 +76,52 @@ for.end:
>>>>> ret void
>>>>> }
>>>>> 
>>>>> +define zeroext i16 @test1(i16* nocapture readonly %arr, i32 %n)
>>>>> #0 {
>>>>> +entry:
>>>>> +  %cmp25 = icmp eq i32 %n, 0
>>>>> +  br i1 %cmp25, label %for.end, label %for.body
>>>>> +
>>>>> +for.body:                                         ; preds =
>>>>> %entry, %for.body
>>>>> +  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
>>>>> %entry ]
>>>>> +  %reduction.026 = phi i16 [ %add14, %for.body ], [ 0, %entry ]
>>>>> +  %arrayidx = getelementptr inbounds i16* %arr, i64 %indvars.iv
>>>>> +  %0 = load i16* %arrayidx, align 2
>>>>> +  %add = add i16 %0, %reduction.026
>>>>> +  %sext = mul i64 %indvars.iv, 12884901888
>>>>> +  %idxprom3 = ashr exact i64 %sext, 32
>>>>> +  %arrayidx4 = getelementptr inbounds i16* %arr, i64 %idxprom3
>>>>> +  %1 = load i16* %arrayidx4, align 2
>>>>> +  %add7 = add i16 %add, %1
>>>>> +  %sext28 = mul i64 %indvars.iv, 21474836480
>>>>> +  %idxprom10 = ashr exact i64 %sext28, 32
>>>>> +  %arrayidx11 = getelementptr inbounds i16* %arr, i64 %idxprom10
>>>>> +  %2 = load i16* %arrayidx11, align 2
>>>>> +  %add14 = add i16 %add7, %2
>>>>> +  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>>>> +  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
>>>>> +  %exitcond = icmp eq i32 %lftr.wideiv, %n
>>>>> +  br i1 %exitcond, label %for.end, label %for.body
>>>>> +
>>>>> +for.end:                                          ; preds =
>>>>> %for.body, %entry
>>>>> +  %reduction.0.lcssa = phi i16 [ 0, %entry ], [ %add14,
>>>>> %for.body ]
>>>>> +  ret i16 %reduction.0.lcssa
>>>>> +
>>>>> +; This loop is too large to be partially unrolled (size=16)
>>>>> +
>>>>> +; CHECK-LABEL: @test1
>>>>> +; CHECK: br
>>>>> +; CHECK: br
>>>>> +; CHECK: br
>>>>> +; CHECK: br
>>>>> +; CHECK-NOT: br
>>>>> +
>>>>> +; CHECK-NOUNRL-LABEL: @test1
>>>>> +; CHECK-NOUNRL: br
>>>>> +; CHECK-NOUNRL: br
>>>>> +; CHECK-NOUNRL: br
>>>>> +; CHECK-NOUNRL: br
>>>>> +; CHECK-NOUNRL-NOT: br
>>>>> +}
>>>>> +
>>>>> attributes #0 = { nounwind uwtable }
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140506/09e3dd1e/attachment.html>