[PATCH] Unrolling improvements (target indep. and for x86)

Mon Mar 3 14:54:55 PST 2014

----- Original Message -----
> From: "Diego Novillo" <dnovillo at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Chandler Carruth" <chandlerc at google.com>, "Nadav Rotem"
> <nrotem at apple.com>
> Sent: Monday, March 3, 2014 9:40:03 AM
> Subject: Re: [PATCH] Unrolling improvements (target indep. and for x86)
> 
> On Sat, Mar 1, 2014 at 3:58 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > ----- Original Message -----
> >> From: "Diego Novillo" <dnovillo at google.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Chandler Carruth"
> >> <chandlerc at google.com>, "Nadav Rotem"
> >> <nrotem at apple.com>
> >> Sent: Friday, February 28, 2014 7:01:02 AM
> >> Subject: Re: [PATCH] Unrolling improvements (target indep. and for
> >> x86)
> >>
> >> Hal,
> >>
> >> I've tested the partial unrolling patch with our internal
> >> benchmarks.
> >> It gives a nice ~5% improvement on one of them while keeping code
> >> size
> >> almost identical.  The other benchmarks are unaffected.  On SPEC
> >> 2006,
> >> the effects are in the noise, so all in all I would say that it is
> >> safe to include.
> >>
> >> One thing I did not do here is to make sure the machine I was
> >> building
> >> on was the same flavour of x86 as the machine running the
> >> benchmark
> >> (I
> >> think they were, however).  I did not use any -mtune options and
> >> allowed ::getUnrollingPreferences to choose its max ops factor by
> >> itself.
> >>
> >> Question on the patch:
> >>
> >> +  // Scan the loop: don't unroll loops with calls, and count
> >> potential branches.
> >> +  // FIXME: This branch check is not quite right because we
> >> should
> >> be counting
> >> +  // the number of branches after unrolling, not before
> >> unrolling.
> >>
> >> Isn't this a matter of estimating how many back branches you
> >> removed
> >> and account for the body expansion with branches that will be
> >> added
> >> by
> >> it?
> >
> > Yes; the correct way to do this is to pass the limit to the
> > unroller; we can't really estimate it here (except for being over
> > conservative). The docs say that the limit is on the number of
> > "taken" branches in the buffer, and so we really want to do some
> > tree depth calculation. I'm unsure how much that matters for such
> > small loops.
> >
> >>
> >> How about allowing this by default on x86?  I'd like to experiment
> >> with larger unroll factors for known-to-be-hot loops.
> >
> > You mean allowing this for all target cpus?
> 
> Not sure if it makes sense on all other CPUs.  I've only tested on
> x86
> and the effect has been positive (not huge, but I have not seen any
> regressions).

I meant for all x86 CPUs ;) Currently, the logic tries to enable this only for CPUs known to have some kind of loop uop buffer. Do you mean enabling it for all x86 CPUs?

> 
> In terms of the patches, the GEP changes are the most obscure to me.
> The patch has a few whitespace problems and I'd love to see more
> comments there.

The core problem is demonstrated by the test case:

+  %.sum9 = or i64 %index, 2
+  %2 = getelementptr double* %b, i64 %.sum9
+  %3 = bitcast double* %2 to <2 x double>*
+  %wide.load8 = load <2 x double>* %3, align 8

the question is: How many operations are contributed by the 'or' and the 'getelementptr' instructions? The previous heuristic says that GEPs will all constant indicies are free (will be folded with their uses), and all others have a cost. In order to correctly estimate the number of uops on x86 for the loop in the test case, this is not good enough. In this case the 'or' (which is really a canonicalized add) and the GEP are both folded. This can be accurately determined by querying TTI for the associated addressing mode legality, and that's what the patch does. Unfortunately, this gets somewhat complicated because recognizing adds is non-trivial (sometimes they're canonicalized as 'or', sometimes as 'xor' and sometimes as 'add'). The code that does so, as noted in the comment, is a fairly straightforward adaptation of the associated logic in ScalarEvolution which recognizes these various form of 'add'.

I've added some additional comments (and fixed up some of the others) which hopefully make it clearer what is going on. What do you think?

> 
> On the partial unrolling patch, could you clarify in the help message
> that the threshold is number of operations? It's unclear at first.
> 
> In the late unroller patch:
> 
> --- a/lib/Transforms/Scalar/LoopUnrollPass.cpp
> +++ b/lib/Transforms/Scalar/LoopUnrollPass.cpp
> @@ -61,6 +61,7 @@ namespace {
>                           (UnrollAllowPartial.getNumOccurrences() >
>                           0);
>        UserRuntime = (R != -1) || (UnrollRuntime.getNumOccurrences()
>        > 0);
>        UserCount = (C != -1) || (UnrollCount.getNumOccurrences() >
>        0);
> +      UserRuntime = (R != -1) || (UnrollRuntime.getNumOccurrences()
> > 0);
> 
> Isn't this unnecessary?  UserRuntime is set identically two lines
> above.

Yes; this was a merging error.

Also, and I had completely forgotten about this, the unrolling patch also contains a behavioral change for partial unrolling: When partially unrolling, the new behavior prefers to choose an unrolling factor that is a power of 2. If that, however, decreases the unrolling factor by more than a factor of 2, then we'll revert to the original non-power-of-2 unrolling factor. My original motivation for this was to help with basic-block vectorization. I'm not sure this matters as much now (because the SLP vectorizer matters more and likely benefits less from this, and the partial unrolling now takes place after vectorization). However, Chandler had made a similar change to the loop vectorizer (to prefer power-of-2 unrolling factors), because it helps with X86 addressing modes, and as a result, I wonder whether it is worth keeping this part regardless. Chandler?

Rebased and updated patches attached.

Thanks again,
Hal

> 
> Thanks for the improvements!
> 
> 
> Diego.
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gep-add-cost-v2.patch
Type: text/x-patch
Size: 7641 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140303/f1fde263/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-partial-unrolling-v2.patch
Type: text/x-patch
Size: 4453 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140303/f1fde263/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: complex-unrolling-late-v2.patch
Type: text/x-patch
Size: 3822 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140303/f1fde263/attachment-0002.bin>