[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
Tobias Grosser
grosser at fim.uni-passau.de
Sat Jan 8 10:26:20 PST 2011
On 01/07/2011 12:36 AM, ether zhhb wrote:
> Hi tobi,
>
>
>>> 2. Allow the some generic parallelism information live out specific
>>> autopar framework, so these information can benefit more passes in
>>> llvm. For example, the X86 and PTX backend could use these information
>>> to perform target specific auto-vectorization.
>>
>> What other types of parallelism are you expecting? We currently support
>> thread level parallelism (as in OpenMP) and vector level parallelism (as in
>> LLVM-IR vectors). At least for X86 I do not see any reason for
>> target specific auto-vectorization as LLVM-IR vectors are lowered extremely
>> well to x86 SIMD instructions. I suppose this is the same for all CPU
>> targets. I still need to look into GPU targets.
>>
> I just think the vector units in different target may have a
> difference width, so the best unroll count of a loop for vectorization
> in not know in high level optimization passes.
I believe we can obtain this information from the target data. If this
information is not yet available target data should be extended,as also
high level loop nest transformations should have knowledge about vector
width and at best even about the number of registers, if we want to
support effective register tiling.
>> It has however the drawback that instead of just doing code generation once
>> after polly, we do sequential code generation -> reparsing/analysis ->
>> parallel code generation. Furthermore, the infrastructure needs to pass all
>> the information needed
>> for efficient parallelisation which are at least the access strides, the
>> alignment and privatized variables. Recomputing this information using
>> scalar evolution might be difficult as Polly may introduce
>> loop ivs using e.g. ceil/floor divisions.
>
> To overcame this, We can encode these kind of "hard to recover"
> information as metadata while generating sequential code, and what the
> later "Polyhedral Parallelism Analysis" pass need to do is just read
> these information form metadata, and reparsing/analysis other
> information which is easy to recover. so the process become:
> sequential code generation and metadata annotation -> read metadata
> (and perform some cheap reparsing/analysis)->parallel code generation
I believe this is a reasonable amount of work and in terms of
vectorization for Polly I _currently_ see limited benefits. The current
advantage is- as Renato pointed out - that we could create a very light
weight vectorizer by taking advantage of the existing loop passes. Also
in terms of openmp code generation, this might be a good way.
> The bigger picture is:
> 1. Define the common interface for "Parallelism Analysis" or
> "LoopDependenceAnalysis", just like AliasAnalysis.
> 2. Then we can have different implementations of Parallelism Analysis.
> For example, we may have the "SCEVParallelsimAnalysis", which
> compute the parallelism information base on SCEV.
> and we can also have the "PolyhedralParallelismAnalysis", which
> read "hard to recover" information from metadata and recompute the
> cheap information, then provides these information via the common
> "Parallelism Analysis" interface.
> 3. The auto-vectorization and parallelization codegen passes can just
> ask the common interface of "Parallelism Analysis" to get necessary
> information.
A reasonable approach.
> The new approach may also make current work for OpenMP support esaier,
> Instead of generate the subfunction directly from clast and insert new
> function in a region pass(it seems that we can only insert new
> function in a modulepass or callgraphSCC pass), we can extract the
> body of the parallel for to a new function with existing CodeExtractor
> in LLVM.
I agree we need to improve the implementation of the OpenMP support. The
reason I did not propose a integrated framework yet is that I still need
to understand OpenMP a little bit better. Hope after the basic OpenMP
support in Polly is finished, we can move to an LLVM integrated
approach. As we already have an working implementation and test cases we
can compare against, this will probably be an easier shift.
Maybe we can start in that area, by first introducing some generic
openmp intrinsics. And later automatically generate those based on meta
data annotations.
Cheers
Tobi
More information about the llvm-dev
mailing list