[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Thu Jan 6 07:59:10 PST 2011

On 6 January 2011 15:16, Tobias Grosser <grosser at fim.uni-passau.de> wrote:
>> The main idea is, we separate the transform passes and codegen passes
>> for auto-parallelization and vectorization (Graphite[2] for gcc seems
>> to taking similar approach for auto-vectorization).

I agree with Ether.

A two-stage vectorization would allow you to use the simple
loop-unroller already in place to generate vector/mp intrinsics from
them, and if more parallelism is required, use the expensive Poly
framework to skew loops and remove dependencies, so the loop-unroller
and other cheap bits can do their job where then couldn't before.

So, in essence, this is a three-stage job. The optional heavy-duty
Poly analysis, the cheap loop-optimizer and the mp/vector
transformation pass. The best features of having them three is to be
able to choose the level of vectorization you want and to re-use the
current loop analysis into the scheme.

> What other types of parallelism are you expecting? We currently support
> thread level parallelism (as in OpenMP) and vector level parallelism (as
> in LLVM-IR vectors). At least for X86 I do not see any reason for
> target specific auto-vectorization as LLVM-IR vectors are lowered
> extremely well to x86 SIMD instructions. I suppose this is the same for
> all CPU targets. I still need to look into GPU targets.

I'd suggest to try and transform sequential instructions into vector
instructions (in the third stage) if proven to be correct.

So, when Poly skews a loop, and the loop analysis unrolls it to, say,
4 calls to the same instruction, a metadata binding them together can
hint the third stage to make that a vector operation with the same
semantics.

> LLVM-IR vector instructions however are generic SIMD
> instructions so I do not see any reason to create target specific
> auto vectorizer passes.

If you're assuming the original code is using intrinsics, that is
correct. But if you want to generate the vector code from Poly, than
you need to add that support, too.

ARM also has good vector instruction selection (on Cortex-A* with
NEON), so you also get that for free. ;)

cheers,
--renato