[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
Tobias Grosser
grosser at fim.uni-passau.de
Thu Jan 6 07:16:57 PST 2011
On 01/06/2011 03:38 AM, ether zhhb wrote:
> Hi,
>
> I just have a detail look at the code of Polly[1], it seems that Polly
> start to support some basic auto-parallelization stuffs.
This is true. However still work in progress. I hope we can soon show
some interesting results.
> I have some idea to improve the current auto-vectorization
> and parallelization approach in Polly.
> The main idea is, we separate the transform passes and codegen passes
> for auto-parallelization and vectorization (Graphite[2] for gcc seems
> to taking similar approach for auto-vectorization).
This is just historical. Sebastian is planning to couple them a lot
closer. Graphite and auto parallelization/vectorization are not yet
playing well together, even if Sebastian made some huge progress with
Graphite<->Autovec interaction.
> That means Polly (Or other similar framework) should perform necessary
> code transform, then just generates the sequential code, and provides
> necessary parallelism information (These information could be encoded
> as metadata just like TBAA), then the later passes can generate the
> parallel code with those parallelism information.
I also thought about this approach.
> The main benefit of separating transform passes and codegen passes are:
> 1. Decouple the the autopar framework from various parallel runtime
> environment, so we can keep both autopar framework and code generation
> pass for specific parallel runtime environment simple. And we can
> support more parallel runtime environments easily.
Yes, as soon as we start to support more targets we should move out some
of the logic into separate functions/file/passes. This could be either
done inside polly or as a generic LLVM pass, depending
on which is more suitable.
> 2. Allow the some generic parallelism information live out specific
> autopar framework, so these information can benefit more passes in
> llvm. For example, the X86 and PTX backend could use these information
> to perform target specific auto-vectorization.
What other types of parallelism are you expecting? We currently support
thread level parallelism (as in OpenMP) and vector level parallelism (as
in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered
extremely well to x86 SIMD instructions. I suppose this is the same for
all CPU targets. I still need to look into GPU targets.
> Implementation consideration:
>
> We may define some kind of generic "parallelism analysis" or the
> generic version of "LoopDependenceAnalysis" interface just like
> AliasAnalysis, or we can also encode those parallelism information as
> metadata. And combining the both should be OK, too.
Thanks ether for starting to reason about this.
My current plan is as follows:
Create stable support for OpenMP and LLVM-IR vector generation inside
polly. This is not very difficult, as you the needed analysis can easily
be performed in the polyhedral framework. Code generation is not
difficult either. SIMD Vector support is almost complete and OpenMP code
generation is also half way through. Based on that infrastructure we can
take advantage of both thread level as well as SIMD level parallelism
and which covers the two most common targets.
The next step will be to evaluate loop transformations that enable
thread and vector level parallelism as it is done e.g. in pluto[1].
I am pretty confident we can show good improvements. Stay tuned. ;-)
Regarding your proposal I hope to move some of the code into more
generic frameworks. I am thinking e.g. about introducing OpenMP
intrinsics to support different OpenMP libraries like libgomp and
mpc[2]. LLVM-IR vector instructions however are generic SIMD
instructions so I do not see any reason to create target specific
auto vectorizer passes.
The remaining question is how/where to generate parallel/vector
code.
The current approach is to do this directly at Polly code generation
time, the solution you proposed would be to generate sequential
code inside polly, annotate it with meta data, reparse it and finally
create openmp/vector code.
The second approach would have the advantage that multiple LLVM passes
can use the information and generate code from it as well as there could
exist several analysis that create the needed metadata.
It has however the drawback that instead of just doing code generation
once after polly, we do sequential code generation -> reparsing/analysis
-> parallel code generation. Furthermore, the infrastructure needs to
pass all the information needed
for efficient parallelisation which are at least the access strides, the
alignment and privatized variables. Recomputing this information using
scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.
As Polly - based on the current approach - will soon support target
agnostic autovectorization and OpenMP code generation, I plan to now
focus on polyhedral loop nest optimizations that enable efficient
vectorization and thread level parallelism. As this transformations are
performed on the abstract polyhedral description, no further changes to
the code generation are needed. In case a larger framework is shown to
be useful, I will definitely support this. For the moment however I am
very fine with the existing lightweight approach.
Cheers
Tobi
[1] http://pluto-compiler.sourceforge.net/
[2] http://mpc.sf.net
More information about the llvm-dev
mailing list