[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
Hal Finkel
hfinkel at anl.gov
Thu Nov 10 14:24:44 PST 2011
On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:
> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
>
> [A lot of performance results skipped]
>
> OK. As expected part of the speedup is because of unrolling, however it
> also shows that vectorization itself improves performance. There is
> still a lot of room for improvement, but it seems to be a good start.
>
> >>>> If I understood correctly it seems your vectorizer has quadratic
> >>>> complexity which may cause large slowdowns. Do you think it may be
> >>>> useful/possible to make it linear by introducing a constant upper bound
> >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
> >>>> most of the vectorization opportunities are close by (in some sense),
> >>>> such that we get most of the speedup by locking at a subset of the problem.
> >>>
> >>> Yes, I agree. That makes a lot of sense.
> >>>
> >>> What would be even better is if the loop unroller would intermix
> >>> statements from the loops where possible instead of leaving it to the
> >>> vectorizer to do all of the grouping after the fact. That, I fear, is a
> >>> whole other project.
> >>
> >> First, I do not understand the 'even better' here. To me it would be
> >> great if on one hand the vectorizer could be constrained, such that the
> >> compile time is predictable. And on the other hand, we could make the
> >> loop unroller create code in a way such that the constrained vectorizer
> >> still performs the relevant transformations.
> >
> > I was not clear; I meant that imposing a cut off is likely to work, but
> > it would work even better if the loop unroller produced code
> > more-conducive to vectorization.
>
> Is such a cutoff already integrated and can be enabled by default.
No, but I'll add one.
> I
> believe it would be great if we could show that the compile time
> increase is limited by a low constant (maybe 100%), while still showing
> an overall improvement in run time.
>
> >> Also, what do you mean with 'if the loop unroller would intermix
> >> statements from the loops where possible'. Are you referring to the
> >> grouped unrolling as shown in my the last mail?
> >
> > Yes.
> >
> > Also, the code necessary to take advantage of grouped unrolling already
> > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> > causes the vectorizer to stop searching for instruction pairings after
> > the first use of an instruction.
>
> Nice. I just tried one of the very basic examples from the Polly test
> suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create
> any vector instructions. I also played a little with the vectorizer
> options, but was not able to get this example vectorized. Is this
> because the chain is too short?
You are correct. It will vectorize with:
opt -O3 -vectorize -bb-vectorize-req-chain-depth=2
>
> >>> One problem with the current implementation is that it relies on
> >>> GetPointerBaseWithConstantOffset to determine if two loads or stores
> >>> share the same base address. This fails with partially-unrolled loops
> >>> because it cannot "undo" all of the additions to the offset induction
> >>> variable in order to understand that some of the loads and stores are
> >>> really adjacent in memory. This is something that I think can be
> >>> improved within the vectorizer itself, and I'm planning on doing some
> >>> work on this in the future.
> >> Here you may also want to look into ScalarEvolution. Basically two loads
> >> access adjacent memory if the difference of the scalar evolution of the
> >> two load addresses is equal to sizeof(element_type). ScalarEvolution
> >> should be a lot more general than GetPointerBaseWithConstantOffset().
> >
> > Thanks! That sounds great; I'll have to look at that.
>
> Talking about this I looked again into ScalarEvolution.
>
> To analyze a load, you would do:
>
> LoadInst *Load = ...
> Value *Pointer = Load->getPointer();
> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> const SCEVUnknown *PointerBase =
> dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
>
> if (!PointerBase)
> return 'Analysis failed'
>
> const Value *BaseValue = PointerBase->getValue();
>
> You get the offset between two load addresses with SE->getMinusSCEV().
> The size of an element is SE->getSizeOfExpr().
This is great! Thanks!
-Hal
>
> Hope that helps
> Tobi
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-dev
mailing list