[llvm-commits] [LLVMdev] [PATCH] BasicBlock Autovectorization Pass

Tue Jan 24 16:41:32 PST 2012

On Tue, 2012-01-24 at 16:08 -0600, Sebastian Pop wrote:
> On Mon, Jan 23, 2012 at 10:13 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > On Tue, 2012-01-17 at 13:25 -0600, Sebastian Pop wrote:
> >> Hi,
> >>
> >> On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at grosser.es> wrote:
> >> > As it seems my intuition is wrong, I am very eager to see and understand
> >> > an example where a search limit of 4000 is really needed.
> >> >
> >>
> >> To make the ball roll again, I attached a testcase that can be tuned
> >> to understand the impact on compile time for different sizes of a
> >> basic block.  One can also set the number of iterations in the loop to
> >> 1 to test the vectorizer with no loops around.
> >>
> >> Hal, could you please report the compile times with/without the
> >> vectorizer for different basic block sizes?
> >
> > I've looked at your test case, and I am pleased to report a negligible
> > compile-time increase! Also, there is no vectorization of the main
> 
> Good!
> 
> > loop :) Here's why: (as you know) the main part of the loop is
> > essentially one long dependency chain, and so there is nothing to
> > vectorize there. The only vectorization opportunities come from
> > unrolling the loop. Using the default thresholds, the loop will not even
> > partially unroll (because the body is too large). As a result,
> > essentially nothing happens.
> >
> > I've prepared a reduced version of your test case (attached). Using
> > -unroll-threshold=300 (along with -unroll-allow-partial), I can make the
> > loop unroll partially (the reduced loop size is 110, so this allows
> > unrolling 2 iterations). Once this is done, the vectorizer finds
> > candidate pairs and vectorizes [as a practical manner, you need -basicaa
> > too].
> >
> > I think that even this is probably too big for a regression test. I
> > don't think that the basic structure really adds anything over existing
> > tests (although I need to make sure that alias-analysis use is otherwise
> > covered), but I'll copy-and-paste a small portion into a regression test
> > to cover the search limit logic (which is currently uncovered). We
> > should probably discuss different situations that we'd like to see
> > covered in the regression suite (perhaps post-commit).
> >
> > Thanks for working on this! I'll post an updated patch for review
> > shortly.
> 
> Thanks for the new patch.
> 
> I will send you some more comments on the patch as I'm advancing
> through testing: I found some interesting benchmarks in which
> enabling vectorization gets the performance down by 80% on ARM.
> I will prepare a reduced testcase and try to find out the reason.
> As a first shot, I would say that this comes from the vectorization of
> code in a loop and the overhead of transfer between scalar and
> vector registers.

This is good; as has been pointed out, we'll need to develop a
vectorization cost model for this kind of thing to really be successful,
and so we should start thinking about that.

The pass, as implemented, has an semi-implicit cost model which says
that permutations followed by another vector operation are free, scalar
-> vector transfers are free, and vectorizing a memory operation is just
as good as vectorizing an arithmetic operation. Depending on the system,
these may all be untrue (although on some systems they are true).

If you can generate a test case that would be great, I'd like to look at
it.

> 
> I would like to not stop you from committing the patch just because
> of performance issues: let's address any further improvements once
> the patch is installed on tot.

Sounds good to me.

Thanks again,
Hal

> 
> Thanks again,
> Sebastian
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory