[PATCH] Unrolling improvements (target indep. and for x86)

Mon Mar 3 07:40:03 PST 2014

On Sat, Mar 1, 2014 at 3:58 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Diego Novillo" <dnovillo at google.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Chandler Carruth" <chandlerc at google.com>, "Nadav Rotem"
>> <nrotem at apple.com>
>> Sent: Friday, February 28, 2014 7:01:02 AM
>> Subject: Re: [PATCH] Unrolling improvements (target indep. and for x86)
>>
>> Hal,
>>
>> I've tested the partial unrolling patch with our internal benchmarks.
>> It gives a nice ~5% improvement on one of them while keeping code
>> size
>> almost identical.  The other benchmarks are unaffected.  On SPEC
>> 2006,
>> the effects are in the noise, so all in all I would say that it is
>> safe to include.
>>
>> One thing I did not do here is to make sure the machine I was
>> building
>> on was the same flavour of x86 as the machine running the benchmark
>> (I
>> think they were, however).  I did not use any -mtune options and
>> allowed ::getUnrollingPreferences to choose its max ops factor by
>> itself.
>>
>> Question on the patch:
>>
>> +  // Scan the loop: don't unroll loops with calls, and count
>> potential branches.
>> +  // FIXME: This branch check is not quite right because we should
>> be counting
>> +  // the number of branches after unrolling, not before unrolling.
>>
>> Isn't this a matter of estimating how many back branches you removed
>> and account for the body expansion with branches that will be added
>> by
>> it?
>
> Yes; the correct way to do this is to pass the limit to the unroller; we can't really estimate it here (except for being over conservative). The docs say that the limit is on the number of "taken" branches in the buffer, and so we really want to do some tree depth calculation. I'm unsure how much that matters for such small loops.
>
>>
>> How about allowing this by default on x86?  I'd like to experiment
>> with larger unroll factors for known-to-be-hot loops.
>
> You mean allowing this for all target cpus?

Not sure if it makes sense on all other CPUs.  I've only tested on x86
and the effect has been positive (not huge, but I have not seen any
regressions).

In terms of the patches, the GEP changes are the most obscure to me.
The patch has a few whitespace problems and I'd love to see more
comments there.

On the partial unrolling patch, could you clarify in the help message
that the threshold is number of operations? It's unclear at first.

In the late unroller patch:

--- a/lib/Transforms/Scalar/LoopUnrollPass.cpp
+++ b/lib/Transforms/Scalar/LoopUnrollPass.cpp
@@ -61,6 +61,7 @@ namespace {
                          (UnrollAllowPartial.getNumOccurrences() > 0);
       UserRuntime = (R != -1) || (UnrollRuntime.getNumOccurrences() > 0);
       UserCount = (C != -1) || (UnrollCount.getNumOccurrences() > 0);
+      UserRuntime = (R != -1) || (UnrollRuntime.getNumOccurrences() > 0);

Isn't this unnecessary?  UserRuntime is set identically two lines above.

Thanks for the improvements!


Diego.