[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Sat Jan 8 10:26:20 PST 2011

On 01/07/2011 12:36 AM, ether zhhb wrote:
> Hi tobi,
>
>
>>> 2. Allow the some generic parallelism information live out specific
>>> autopar framework, so these information can benefit more passes in
>>> llvm. For example, the X86 and PTX backend could use these information
>>> to perform target specific auto-vectorization.
>>
>> What other types of parallelism are you expecting? We currently support
>> thread level parallelism (as in OpenMP) and vector level parallelism (as in
>> LLVM-IR vectors). At least for X86 I do not see any reason for
>> target specific auto-vectorization as LLVM-IR vectors are lowered extremely
>> well to x86 SIMD instructions. I suppose this is the same for all CPU
>> targets. I still need to look into GPU targets.
>>
> I just think the vector units in different target may have a
> difference width, so the best unroll count of a loop for vectorization
> in not know in high level optimization passes.
I believe we can obtain this information from the target data. If this 
information is not yet available target data should be extended,as also 
high level loop nest transformations should have knowledge about vector 
width and at best even about the number of registers, if we want to 
support effective register tiling.

>> It has however the drawback that instead of just doing code generation once
>> after polly, we do sequential code generation ->  reparsing/analysis ->
>> parallel code generation. Furthermore, the infrastructure needs to pass all
>> the information needed
>> for efficient parallelisation which are at least the access strides, the
>> alignment and privatized variables. Recomputing this information using
>> scalar evolution might be difficult as Polly may introduce
>> loop ivs using e.g. ceil/floor divisions.
>
> To overcame this, We can encode these kind of "hard to recover"
> information as metadata while generating sequential code, and what the
> later "Polyhedral Parallelism Analysis" pass need to do is just read
> these information form metadata, and reparsing/analysis other
> information which is easy to recover. so the process become:
> sequential code generation and metadata annotation ->  read metadata
> (and perform some cheap reparsing/analysis)->parallel code generation
I believe this is a reasonable amount of work and in terms of 
vectorization for Polly I _currently_ see limited benefits. The current 
advantage is- as Renato pointed out - that we could create a very light 
weight vectorizer by taking advantage of the existing loop passes. Also 
in terms of openmp code generation, this might be a good way.

> The bigger picture is:
> 1. Define the common interface for "Parallelism Analysis" or
> "LoopDependenceAnalysis", just like AliasAnalysis.
> 2. Then we can have different implementations of Parallelism Analysis.
>      For example, we may have the "SCEVParallelsimAnalysis", which
> compute the parallelism information base on SCEV.
>      and we can also have the "PolyhedralParallelismAnalysis", which
> read "hard to recover" information from metadata and recompute the
> cheap information, then provides these information via the common
> "Parallelism Analysis" interface.
> 3. The auto-vectorization and parallelization codegen passes can just
> ask the common interface of "Parallelism Analysis" to get necessary
> information.
A reasonable approach.

> The new approach may also make current work for OpenMP support esaier,
> Instead of generate the subfunction directly from clast and insert new
> function in a region pass(it seems that we can only insert new
> function in a modulepass or callgraphSCC pass), we can extract the
> body of the parallel for to a new function with existing CodeExtractor
> in LLVM.
I agree we need to improve the implementation of the OpenMP support. The
reason I did not propose a integrated framework yet is that I still need 
to understand OpenMP a little bit better. Hope after the basic  OpenMP 
support in Polly is finished, we can move to an LLVM integrated 
approach. As we already have an working implementation and test cases we 
can compare against, this will probably be an easier shift.

Maybe we can start in that area, by first introducing some generic 
openmp intrinsics. And later automatically generate those based on meta 
data annotations.

Cheers
Tobi