[llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit unrolling.

Tue May 6 09:41:12 PDT 2014

----- Original Message -----
> From: "Louis Gerbarg" <lgg at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Benjamin Kramer" <benny.kra at gmail.com>
> Sent: Tuesday, May 6, 2014 11:27:44 AM
> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial unrolling, use the PartialThreshold to limit
> unrolling.
> 
> 
> 
> 
> On May 6, 2014, at 12:26 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> ----- Original Message -----
> 
> 
> From: "Louis Gerbarg" < lgg at apple.com >
> To: "Benjamin Kramer" < benny.kra at gmail.com >
> Cc: "llvm-commits" < llvm-commits at cs.uiuc.edu >
> Sent: Monday, May 5, 2014 9:21:32 PM
> Subject: Re: [llvm] r207940 - LoopUnroll: If we're doing partial
> unrolling, use the PartialThreshold to limit
> unrolling.
> 
> I am seeing what appears to be significant regressions on some of the
> nightly tests from this patch. For example the following benches all
> show 50-100% slowdowns when I apply the patch:
> 
> SingleSource/Benchmarks/Stanford/Bubblesort
> SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog/dynprog
> MultiSource/Benchmarks/FreeBench/neural/neural
> 
> To be clear, on what system is that?
> 
> 
> I’ve confirmed the regressions personally on Haswell. Looking at our
> buildbots it also appears to be occurring on Sandybridge and Penryn
> (though with some variation, the bubblesort regression is ~100% on
> Haswell, it is 120% on Penryn).
> 
> 
> 
> 
> 
> Generally speaking, if you use -mllvm
> -x86-partial-unrolling-threshold=N as you increase the value of N
> does that improve things for you? The current values for the
> unrolling were taken from the optimization manual (and I'm guessing
> is some value like 28 for your system; see
> X86TTI::getUnrollingPreferences in
> lib/Target/X86/X86TargetTransformInfo.cpp), but was not actually
> tuned experimentally. Perhaps this could use some improvement.
> 
> 
> 
> 
> No. In fact, the regressed numbers are all stable and appear to occur
> at any value of -x86-partial-unrolling-threshold. I also see the
> regressions even without r207940 if I pass
> -x86-partial-unrolling-threshold. Some quick and dirty numbers from
> my Haswell system:

That's interesting. Thanks for helping with this! My hypothesis is that either:
 1. None of those are large enough (try 60000 and see if it does anything).
 2. You're hitting the branch cutoff: try setting -x86-partial-max-branches=40000 (or something else really large) and then try again.

 -Hal

> 
> 
> BASELINE:
> 
> bash-3.2$ ../good/bin/clang -O3 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.021s
> user 0m0.011s
> sys 0m0.001s
> 
> 
> WITHOUT r207940:
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.032s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> 
> 
> 
> 
> bash-3.2$ ../good/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> 
> 
> WITH r207940:
> bash-3.2$ ../bad/bin/clang -O3 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s..
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=0 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=18 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=28 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=40 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=60 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.031s
> user 0m0.028s
> sys 0m0.001s
> bash-3.2$ ../bad/bin/clang -O3 -mllvm
> -x86-partial-unrolling-threshold=600 Bubblesort.c && time ./a.out >
> /dev/null
> 
> 
> real 0m0.030s
> user 0m0.027s
> sys 0m0.001s
> bash-3.2$
> 
> 
> Louis
> 
> 
> -Hal
> 
> 
> 
> 
> Louis
> 
> On May 5, 2014, at 3:09 AM, Benjamin Kramer < benny.kra at gmail.com >
> wrote:
> 
> 
> 
> 
> On 05.05.2014, at 01:18, Nadav Rotem < nrotem at apple.com > wrote:
> 
> 
> 
> Hi Ben,
> 
> Thanks for working on this. Overall it sounds like a good change
> and unrolling 8 times sounds way too high, even for small loops.
> Did you get a chance to measure the performance difference of
> this patch?
> 
> I didn't find any significant runtime change in the test suite or
> when trying some of the synthetic benchmarks that were showing
> extreme unrolling. Code size is a bit better though.
> 
> I initially observed this behavior when looking into the vectorizer
> ( http://llvm.org/bugs/show_bug.cgi?id=14985 ) For the trivial
> loop in the test case we used to unroll 2x in the loop vectorizer
> (that's a good thing) and then up to another 8x in the loop
> unroller, when we're targeting core2 or higher. I asked Hal and he
> agreed that we were unrolling too much.
> 
> I guess it makes sense to actually use the threshold derived from
> the processor manuals to drive unrolling instead of assuming that
> more unrolling is better :)
> 
> - Ben
> 
> 
> 
> 
> Thanks,
> Nadav
> 
> 
> On May 4, 2014, at 12:12 PM, Benjamin Kramer
> < benny.kra at googlemail.com > wrote:
> 
> 
> 
> Author: d0k
> Date: Sun May 4 14:12:38 2014
> New Revision: 207940
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=207940&view=rev
> Log:
> LoopUnroll: If we're doing partial unrolling, use the
> PartialThreshold to limit unrolling.
> 
> Otherwise we use the same threshold as for complete unrolling,
> which is
> way too high. This made us unroll any loop smaller than 150
> instructions
> by 8 times, but only if someone specified -march=core2 or better,
> which happens to be the default on darwin.
> 
> Modified:
> llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
> llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
> 
> Modified: llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp?rev=207940&r1=207939&r2=207940&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
> (original)
> +++ llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp Sun May
> 4 14:12:38 2014
> @@ -238,9 +238,12 @@ bool LoopUnroll::runOnLoop(Loop *L, LPPa
> return false;
> }
> uint64_t Size = (uint64_t)LoopSize*Count;
> - if (TripCount != 1 && Size > Threshold) {
> - DEBUG(dbgs() << " Too large to fully unroll with count: "
> << Count
> - << " because size: " << Size << ">" << Threshold <<
> "\n");
> + if (TripCount != 1 &&
> + (Size > Threshold || (Count != TripCount && Size >
> PartialThreshold))) {
> + if (Size > Threshold)
> + DEBUG(dbgs() << " Too large to fully unroll with count:
> " << Count
> + << " because size: " << Size << ">" <<
> Threshold << "\n");
> +
> bool AllowPartial = UserAllowPartial ? CurrentAllowPartial :
> UP.Partial;
> if (!AllowPartial && !(Runtime && TripCount == 0)) {
> DEBUG(dbgs() << " will not try to unroll partially because
> "
> 
> Modified: llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll?rev=207940&r1=207939&r2=207940&view=diff
> ==============================================================================
> --- llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll
> (original)
> +++ llvm/trunk/test/Transforms/LoopUnroll/X86/partial.ll Sun May
> 4 14:12:38 2014
> @@ -76,5 +76,52 @@ for.end:
> ret void
> }
> 
> +define zeroext i16 @test1(i16* nocapture readonly %arr, i32 %n)
> #0 {
> +entry:
> + %cmp25 = icmp eq i32 %n, 0
> + br i1 %cmp25, label %for.end, label %for.body
> +
> +for.body: ; preds =
> %entry, %for.body
> + %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0,
> %entry ]
> + %reduction.026 = phi i16 [ %add14, %for.body ], [ 0, %entry ]
> + %arrayidx = getelementptr inbounds i16* %arr, i64 %indvars.iv
> + %0 = load i16* %arrayidx, align 2
> + %add = add i16 %0, %reduction.026
> + %sext = mul i64 %indvars.iv, 12884901888
> + %idxprom3 = ashr exact i64 %sext, 32
> + %arrayidx4 = getelementptr inbounds i16* %arr, i64 %idxprom3
> + %1 = load i16* %arrayidx4, align 2
> + %add7 = add i16 %add, %1
> + %sext28 = mul i64 %indvars.iv, 21474836480
> + %idxprom10 = ashr exact i64 %sext28, 32
> + %arrayidx11 = getelementptr inbounds i16* %arr, i64 %idxprom10
> + %2 = load i16* %arrayidx11, align 2
> + %add14 = add i16 %add7, %2
> + %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
> + %lftr.wideiv = trunc i64 %indvars.iv.next to i32
> + %exitcond = icmp eq i32 %lftr.wideiv, %n
> + br i1 %exitcond, label %for.end, label %for.body
> +
> +for.end: ; preds =
> %for.body, %entry
> + %reduction.0.lcssa = phi i16 [ 0, %entry ], [ %add14,
> %for.body ]
> + ret i16 %reduction.0.lcssa
> +
> +; This loop is too large to be partially unrolled (size=16)
> +
> +; CHECK-LABEL: @test1
> +; CHECK: br
> +; CHECK: br
> +; CHECK: br
> +; CHECK: br
> +; CHECK-NOT: br
> +
> +; CHECK-NOUNRL-LABEL: @test1
> +; CHECK-NOUNRL: br
> +; CHECK-NOUNRL: br
> +; CHECK-NOUNRL: br
> +; CHECK-NOUNRL: br
> +; CHECK-NOUNRL-NOT: br
> +}
> +
> attributes #0 = { nounwind uwtable }
> 
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory