[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Thu Jan 6 21:36:07 PST 2011

Hi tobi,

>> 2. Allow the some generic parallelism information live out specific
>> autopar framework, so these information can benefit more passes in
>> llvm. For example, the X86 and PTX backend could use these information
>> to perform target specific auto-vectorization.
>
> What other types of parallelism are you expecting? We currently support
> thread level parallelism (as in OpenMP) and vector level parallelism (as in
> LLVM-IR vectors). At least for X86 I do not see any reason for
> target specific auto-vectorization as LLVM-IR vectors are lowered extremely
> well to x86 SIMD instructions. I suppose this is the same for all CPU
> targets. I still need to look into GPU targets.
>
I just think the vector units in different target may have a
difference width, so the best unroll count of a loop for vectorization
in not know in high level optimization passes.

> It has however the drawback that instead of just doing code generation once
> after polly, we do sequential code generation -> reparsing/analysis ->
> parallel code generation. Furthermore, the infrastructure needs to pass all
> the information needed
> for efficient parallelisation which are at least the access strides, the
> alignment and privatized variables. Recomputing this information using
> scalar evolution might be difficult as Polly may introduce
> loop ivs using e.g. ceil/floor divisions.

To overcame this, We can encode these kind of "hard to recover"
information as metadata while generating sequential code, and what the
later "Polyhedral Parallelism Analysis" pass need to do is just read
these information form metadata, and reparsing/analysis other
information which is easy to recover. so the process become:
sequential code generation and metadata annotation -> read metadata
(and perform some cheap reparsing/analysis)->parallel code generation

The bigger picture is:
1. Define the common interface for "Parallelism Analysis" or
"LoopDependenceAnalysis", just like AliasAnalysis.
2. Then we can have different implementations of Parallelism Analysis.
    For example, we may have the "SCEVParallelsimAnalysis", which
compute the parallelism information base on SCEV.
    and we can also have the "PolyhedralParallelismAnalysis", which
read "hard to recover" information from metadata and recompute the
cheap information, then provides these information via the common
"Parallelism Analysis" interface.
3. The auto-vectorization and parallelization codegen passes can just
ask the common interface of "Parallelism Analysis" to get necessary
information.

The new approach may also make current work for OpenMP support esaier,
Instead of generate the subfunction directly from clast and insert new
function in a region pass(it seems that we can only insert new
function in a modulepass or callgraphSCC pass), we can extract the
body of the parallel for to a new function with existing CodeExtractor
in LLVM.

best regards
ether