[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Thu Nov 10 14:07:11 PST 2011

On 11/08/2011 11:29 PM, Hal Finkel wrote:
> On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
>> On 11/08/2011 03:36 PM, Hal Finkel wrote:
>>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
>>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:

[A lot of performance results skipped]

OK. As expected part of the speedup is because of unrolling, however it 
also shows that vectorization itself improves performance. There is 
still a lot of room for improvement, but it seems to be a good start.

>>>> If I understood correctly it seems your vectorizer has quadratic
>>>> complexity which may cause large slowdowns. Do you think it may be
>>>> useful/possible to make it linear by introducing a constant upper bound
>>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
>>>> most of the vectorization opportunities are close by (in some sense),
>>>> such that we get most of the speedup by locking at a subset of the problem.
>>>
>>> Yes, I agree. That makes a lot of sense.
>>>
>>> What would be even better is if the loop unroller would intermix
>>> statements from the loops where possible instead of leaving it to the
>>> vectorizer to do all of the grouping after the fact. That, I fear, is a
>>> whole other project.
>>
>> First, I do not understand the 'even better' here. To me it would be
>> great if on one hand the vectorizer could be constrained, such that the
>> compile time is predictable. And on the other hand, we could make the
>> loop unroller create code in a way such that the constrained vectorizer
>> still performs the relevant transformations.
>
> I was not clear; I meant that imposing a cut off is likely to work, but
> it would work even better if the loop unroller produced code
> more-conducive to vectorization.

Is such a cutoff already integrated and can be enabled by default. I 
believe it would be great if we could show that the compile time 
increase is limited by a low constant (maybe 100%), while still showing 
an overall improvement in run time.

>> Also, what do you mean with 'if the loop unroller would intermix
>> statements from the loops where possible'. Are you referring to the
>> grouped unrolling as shown in my the last mail?
>
> Yes.
>
> Also, the code necessary to take advantage of grouped unrolling already
> exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> causes the vectorizer to stop searching for instruction pairings after
> the first use of an instruction.

Nice. I just tried one of the very basic examples from the Polly test 
suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
any vector instructions. I also played a little with the vectorizer 
options, but was not able to get this example vectorized. Is this 
because the chain is too short?

>>> One problem with the current implementation is that it relies on
>>> GetPointerBaseWithConstantOffset to determine if two loads or stores
>>> share the same base address. This fails with partially-unrolled loops
>>> because it cannot "undo" all of the additions to the offset induction
>>> variable in order to understand that some of the loads and stores are
>>> really adjacent in memory. This is something that I think can be
>>> improved within the vectorizer itself, and I'm planning on doing some
>>> work on this in the future.
>> Here you may also want to look into ScalarEvolution. Basically two loads
>> access adjacent memory if the difference of the scalar evolution of the
>> two load addresses is equal to sizeof(element_type). ScalarEvolution
>> should be a lot more general than GetPointerBaseWithConstantOffset().
>
> Thanks! That sounds great; I'll have to look at that.

Talking about this I looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase =
	dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with SE->getMinusSCEV(). 
The size of an element is SE->getSizeOfExpr().

Hope that helps
Tobi
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: simple_vec.ll
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111110/15790d03/attachment.ksh>