# [LLVMdev] Vectorization: Next Steps

Tobias Grosser tobias at grosser.es
Tue Feb 7 07:37:34 PST 2012

On 02/06/2012 10:02 PM, Preston Briggs wrote:
> On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at codeaurora.org
> <mailto:spop at codeaurora.org>> wrote:
>> [many things, but I'm only going to focus on one of them] Would
>> you consider using Polly http://polly.grosser.es to avoid writing
>> this code?
>
> My impression is that Polly (and polyhedral analysis generally)
> doesn't do want I want. But I'm happy to talk about it 'cause I
> might be wrong.

That depends what you want. ;-)

Polly may offer good possibilities (and maybe even code) or it may not
help at all. The best is to discuss this stuff.

> If I look at all the ways I know how to sort in parallel, they all
> look roughly like this simple counting sort:
>
> for (unsigned i = 0; i < buckets; i++)
>   count[i] = 0;
>
> for (unsigned i = 0; i < n; i++)
>   count[src[i]]++;
>
> start[0] = 0;
> for (unsigned i = 1; i < buckets; i++)
>   start[i] = start[i - 1] + count[i - 1];
>
> #pragma assert parallel
> for (unsigned i = 0; i < n; i++) {
>   unsigned loc = int_fetch_add(start + src[i], 1);

Should this be:

>   dst[loc] = src[i];
> }
>
>
> The 1st loop is trivially parallel.  I think Polly would recognize
> this and do good things.

This case is trivial.

But keep in mind that unsigned loop ivs and integers can cause modulo
wrapping which is not trivial to handle - Both Polly, but also any other
tool, need to take into account additional (possibly non-uniform)
dependences that may appear in addition to the dependences for
non-wrapping integers. (In LLVM even signed integers can have modulo
wrapping, so actually we need to always be careful here).

isl can actually represent modulo constraints pretty well.

> The 2nd loop has a race condition that can be handled by using an
> atomic increment provided by the architecture, if the compiler knows
> about such things. I don't think Polly would help here.

> for (unsigned i = 0; i < n; i++)
>   count[src[i]]++;

I do not see why it should be difficult to teach polyhedral tools to
model an increment and to understand that it can be done atomically.
Though, if you just want to do this specific transformation, Polly is

The big selling point of polyhedral transformations is uniform handling
of several transformations. In case you want additional loop
transformations to expose such patterns Polly might be helpful.

Another reason may be GPU code generation. Here you often use techniques
like tiling to create the thread groups. Within the polyhedral model
this is a very simple transformation. On LLVM-IR that may become complex.

> The 3rd loop is a linear recurrence that can be rewritten and solved
> in parallel, if the compiler knows about such things. I don't think
> Polly would help here.

Polly has not perform this transformation yet. But it may still be
helpful as it abstracts away the low level details of LLVM-IR and allows
you to focus on the higher level problem.

> start[0] = 0;
> for (unsigned i = 1; i < buckets; i++)
>   start[i] = start[i - 1] + count[i - 1];

The previous code will be translated to:

Statements: Init[]; Body[i] : 1 <= i < buckets

Writes: Init[] -> start[0]; Body[i] -> start[i];
Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];

performed by body

The idea is to go away from low level pattern matching, but to use a
high level approach to implement the transformations proposed by you.
For code that does not fit into the polyhedral model, this is obviously
not possible, but the examples you have given so far can be represented
easily.

> The 4th loop relies on the programmer to use both an explicit
> assertion and an explicit fetch-and-add intrinsic. I think both of
> these are outside of Polly's scope.

We do not yet use this within Polly, but I think asking the programmer
to provide additional knowledge (when available) is more than reasonable.

> #pragma assert parallel
> for (unsigned i = 0; i < n; i++) {
>   unsigned loc = int_fetch_add(start[src[i]], 1);
>   dst[loc] = src[i];
> }

As the int_fetch_add is side effect free, it is fully
polyhedral. It can be implemented with the relevant LLVM
atomicrmw instruction [1]. Polly does not yet allow atomicrmw
instructions, but adding support for them should not be too difficult.

If polly get's the additional #pragma information, it can remove all
dependences and it can execute the code in parallel. This is also not
implemented, but seems doable.

> This is the sort of code I want to handle -- parallel code written
> by knowledgable programmers who're counting on the compiler to
> recognize parallel loops, recognize and rewrite recurrences and
> reductions, and insert synchronization where helpful, using a
> variety of directives to clarify their intent.

Sounds very much like what we want to handle with Polly. This does
not mean we handle all of this already, but we already have some high
level optimizations based on the Pluto algorithm [2] and more
importantly, we already are able to abstract away a lot of low level
details. So I definitely expect significant overlap. Let's see where we
can reuse each others work.

(I am e.g. very interested in array delinearization or in how you pass
loop pragma-annotations to LLVM-IR)  :-)

Cheers
Tobi

[1] http://llvm.org/docs/LangRef.html#i_atomicrmw
[2] http://pluto-compiler.sourceforge.net/