[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Thu Jan 6 07:16:57 PST 2011

On 01/06/2011 03:38 AM, ether zhhb wrote:
> Hi,
>
> I just have a detail look at the code of Polly[1], it seems that Polly
> start to support some basic auto-parallelization stuffs.

This is true. However still work in progress. I hope we can soon show 
some interesting results.

 > I have some idea to improve the current auto-vectorization
 > and parallelization approach in Polly.

> The main idea is, we separate the transform passes and codegen passes
> for auto-parallelization and vectorization (Graphite[2] for gcc seems
> to taking similar approach for auto-vectorization).

This is just historical. Sebastian is planning to couple them a lot 
closer. Graphite and auto parallelization/vectorization are not yet 
playing well together, even if Sebastian made some huge progress with 
Graphite<->Autovec interaction.

> That means Polly (Or other similar framework) should perform necessary
> code transform, then just generates the sequential code, and provides
> necessary parallelism information (These information could be encoded
> as metadata just like TBAA), then the later passes can generate the
> parallel code with those parallelism information.

I also thought about this approach.

> The main benefit of separating transform passes and codegen passes are:
> 1. Decouple the the autopar framework from various parallel runtime
> environment, so we can keep both autopar framework and code generation
> pass for specific parallel runtime environment simple. And we can
> support more parallel runtime environments easily.

Yes, as soon as we start to support more targets we should move out some 
of the logic into separate functions/file/passes. This could be either 
done inside polly or as a generic LLVM pass, depending
on which is more suitable.

> 2. Allow the some generic parallelism information live out specific
> autopar framework, so these information can benefit more passes in
> llvm. For example, the X86 and PTX backend could use these information
> to perform target specific auto-vectorization.

What other types of parallelism are you expecting? We currently support 
thread level parallelism (as in OpenMP) and vector level parallelism (as 
in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered 
extremely well to x86 SIMD instructions. I suppose this is the same for 
all CPU targets. I still need to look into GPU targets.

> Implementation consideration:
>
> We may define some kind of generic "parallelism analysis" or the
> generic version of "LoopDependenceAnalysis" interface just like
> AliasAnalysis, or we can also encode those parallelism information as
> metadata. And combining the both should be OK, too.

Thanks ether for starting to reason about this.

My current plan is as follows:

Create stable support for OpenMP and LLVM-IR vector generation inside 
polly. This is not very difficult, as you the needed analysis can easily 
be performed in the polyhedral framework. Code generation is not 
difficult either. SIMD Vector support is almost complete and OpenMP code 
generation is also half way through. Based on that infrastructure we can 
take advantage of both thread level as well as SIMD level parallelism 
and which covers the two most common targets.
The next step will be to evaluate loop transformations that enable 
thread and vector level parallelism as it is done e.g. in pluto[1].
I am pretty confident we can show good improvements. Stay tuned. ;-)

Regarding your proposal I hope to move some of the code into more 
generic frameworks. I am thinking e.g. about introducing OpenMP 
intrinsics to support different OpenMP libraries like libgomp and 
mpc[2]. LLVM-IR vector instructions however are generic SIMD 
instructions so I do not see any reason to create target specific
auto vectorizer passes.

The remaining question is how/where to generate parallel/vector
code.

The current approach is to do this directly at Polly code generation 
time, the solution you proposed would be to generate sequential
code inside polly, annotate it with meta data, reparse it and finally 
create openmp/vector code.

The second approach would have the advantage that multiple LLVM passes
can use the information and generate code from it as well as there could 
exist several analysis that create the needed metadata.

It has however the drawback that instead of just doing code generation 
once after polly, we do sequential code generation -> reparsing/analysis 
-> parallel code generation. Furthermore, the infrastructure needs to 
pass all the information needed
for efficient parallelisation which are at least the access strides, the 
alignment and privatized variables. Recomputing this information using 
scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.

As Polly - based on the current approach - will soon support target 
agnostic autovectorization and OpenMP code generation, I plan to now 
focus on polyhedral loop nest optimizations that enable efficient 
vectorization and thread level parallelism. As this transformations are
performed on the abstract polyhedral description, no further changes to 
the code generation are needed. In case a larger framework is shown to 
be useful, I will definitely support this. For the moment however I am 
very fine with the existing lightweight approach.

Cheers
Tobi

[1] http://pluto-compiler.sourceforge.net/
[2] http://mpc.sf.net