[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Fri Nov 11 14:36:38 PST 2011

On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:
> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> 
> [A lot of performance results skipped]
> 
> OK. As expected part of the speedup is because of unrolling, however it 
> also shows that vectorization itself improves performance. There is 
> still a lot of room for improvement, but it seems to be a good start.
> 
> >>>> If I understood correctly it seems your vectorizer has quadratic
> >>>> complexity which may cause large slowdowns. Do you think it may be
> >>>> useful/possible to make it linear by introducing a constant upper bound
> >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
> >>>> most of the vectorization opportunities are close by (in some sense),
> >>>> such that we get most of the speedup by locking at a subset of the problem.
> >>>
> >>> Yes, I agree. That makes a lot of sense.
> >>>
> >>> What would be even better is if the loop unroller would intermix
> >>> statements from the loops where possible instead of leaving it to the
> >>> vectorizer to do all of the grouping after the fact. That, I fear, is a
> >>> whole other project.
> >>
> >> First, I do not understand the 'even better' here. To me it would be
> >> great if on one hand the vectorizer could be constrained, such that the
> >> compile time is predictable. And on the other hand, we could make the
> >> loop unroller create code in a way such that the constrained vectorizer
> >> still performs the relevant transformations.
> >
> > I was not clear; I meant that imposing a cut off is likely to work, but
> > it would work even better if the loop unroller produced code
> > more-conducive to vectorization.
> 
> Is such a cutoff already integrated and can be enabled by default. I 
> believe it would be great if we could show that the compile time 
> increase is limited by a low constant (maybe 100%), while still showing 
> an overall improvement in run time.
> 
> >> Also, what do you mean with 'if the loop unroller would intermix
> >> statements from the loops where possible'. Are you referring to the
> >> grouped unrolling as shown in my the last mail?
> >
> > Yes.
> >
> > Also, the code necessary to take advantage of grouped unrolling already
> > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> > causes the vectorizer to stop searching for instruction pairings after
> > the first use of an instruction.
> 
> Nice. I just tried one of the very basic examples from the Polly test 
> suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
> any vector instructions. I also played a little with the vectorizer 
> options, but was not able to get this example vectorized. Is this 
> because the chain is too short?
> 
> >>> One problem with the current implementation is that it relies on
> >>> GetPointerBaseWithConstantOffset to determine if two loads or stores
> >>> share the same base address. This fails with partially-unrolled loops
> >>> because it cannot "undo" all of the additions to the offset induction
> >>> variable in order to understand that some of the loads and stores are
> >>> really adjacent in memory. This is something that I think can be
> >>> improved within the vectorizer itself, and I'm planning on doing some
> >>> work on this in the future.
> >> Here you may also want to look into ScalarEvolution. Basically two loads
> >> access adjacent memory if the difference of the scalar evolution of the
> >> two load addresses is equal to sizeof(element_type). ScalarEvolution
> >> should be a lot more general than GetPointerBaseWithConstantOffset().
> >
> > Thanks! That sounds great; I'll have to look at that.
> 
> Talking about this I looked again into ScalarEvolution.
> 
> To analyze a load, you would do:
> 
> LoadInst *Load = ...
> Value *Pointer = Load->getPointer();
> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> const SCEVUnknown *PointerBase =
> 	dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
> 
> if (!PointerBase)
>    return 'Analysis failed'
> 
> const Value *BaseValue = PointerBase->getValue();
> 
> You get the offset between two load addresses with SE->getMinusSCEV(). 
> The size of an element is SE->getSizeOfExpr().
> 

The AliasAnalysis class has a set of interfaces that can be used to
preserve the analysis even when some things are changed. Does
ScalarEvolution have a similar capability?

Thanks again,
Hal

> Hope that helps
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory