[LLVMdev] Vectorization: Next Steps

Tue Feb 7 10:04:44 PST 2012

>> for (unsigned i = 0; i < buckets; i++)
>>  count[i] = 0;
>>
>> for (unsigned i = 0; i < n; i++)
>>   count[src[i]]++;
>>
>> start[0] = 0;
>> for (unsigned i = 1; i < buckets; i++)
>>   start[i] = start[i - 1] + count[i - 1];
>>
>> #pragma assert parallel
>> for (unsigned i = 0; i < n; i++) {
>>   unsigned loc = int_fetch_add(start + src[i], 1);
>>   dst[loc] = src[i];
>> }

> Should this be:
>
>     unsigned loc = int_fetch_add(start[src[i]], 1);

Our intrinsic wants a pointer, so either int_fetch_add(start + src[i], 1)
or int_fetch_add(&start[src[i]], 1) wil work.

>> The 1st loop is trivially parallel.  I think Polly would recognize
>> this and do good things.
>
> This case is trivial.
>
> But keep in mind that unsigned loop ivs and integers can cause modulo
> wrapping which is not trivial to handle - Both Polly, but also any other
> tool, need to take into account additional (possibly non-uniform)
> dependences that may appear in addition to the dependences for non-wrapping
> integers. (In LLVM even signed integers can have modulo wrapping, so
> actually we need to always be careful here).

Sure.  In this case, it would affect the code that determines the
number of iterations.
If the loop terminates, there's no dependence.

>> The 2nd loop has a race condition that can be handled by using an
>> atomic increment provided by the architecture, if the compiler knows
>> about such things. I don't think Polly would help here.
>
>> for (unsigned i = 0; i < n; i++)
>>   count[src[i]]++;
>
> I do not see why it should be difficult to teach polyhedral tools to
> model an increment and to understand that it can be done atomically.

When I read about polyhedral frame works, the first thing I see is
the restriction to affine subscripts. For loops 2 and 4, this seems to be
the stumbling block.

> The big selling point of polyhedral transformations is uniform handling
> of several transformations.

Right.  In olden times, people use a bunch of independent loop xforms,
composed in different ways to achieve different effects.  Then came the
unimodular xforms which could achieve the effect of combining several
xforms (but not all). The came the polyhedral approach, which could do
quite a bit more, including the actual dependence analysis, as long as
your problem fit in the framework.

I think the sorting examples don't fit within the framework.
Indeed, I think it'd be tough to express a sort (something easy, like a
vector of unsigned values) in a way that Polly would parallelize all the loops.

> Another reason may be GPU code generation. Here you often use techniques
> like tiling to create the thread groups. Within the polyhedral model this is
> a very simple transformation. On LLVM-IR that may become complex.

Yes, of course Polly is good for many things, but I use reductions and
recurrences every day; I must have a good way to handle them.

>> The 3rd loop is a linear recurrence that can be rewritten and solved
>> in parallel, if the compiler knows about such things. I don't think
>> Polly would help here.
>
> Polly has not perform this transformation yet. But it may still be helpful
> as it abstracts away the low level details of LLVM-IR and allows you to
> focus on the higher level problem.

That's the way I think of the dependence graph. It gets me above the IR
and lets me study the flow of dependences.

>> start[0] = 0;
>> for (unsigned i = 1; i < buckets; i++)
>>   start[i] = start[i - 1] + count[i - 1];
>
> The previous code will be translated to:
>
> Statements: Init[]; Body[i] : 1 <= i < buckets
>
> Writes: Init[] -> start[0]; Body[i] -> start[i];
> Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];
>
> When we additionally extract the information about the operation performed
> by body
>
> The idea is to go away from low level pattern matching, but to use a
> high level approach to implement the transformations proposed by you.
> For code that does not fit into the polyhedral model, this is obviously
> not possible, but the examples you have given so far can be represented
> easily.
>
>> The 4th loop relies on the programmer to use both an explicit
>> assertion and an explicit fetch-and-add intrinsic. I think both of
>> these are outside of Polly's scope.
>
> We do not yet use this within Polly, but I think asking the programmer
> to provide additional knowledge (when available) is more than reasonable.

Yep.  My doubts arose because nobody ever mentions any way to take
directives into account when analyzing loops with the polyhedral approach.

>> #pragma assert parallel
>> for (unsigned i = 0; i < n; i++) {
>>   unsigned loc = int_fetch_add(&start[src[i]], 1);
>>  dst[loc] = src[i];
>> }
>
> As the int_fetch_add is side effect free, it is fully
> polyhedral. It can be implemented with the relevant LLVM
> atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions,
> but adding support for them should not be too difficult.

The non-affine subscript is a problem, isn't it?
And do we really consider the int_fetch_add side-effect free?
After all, memory is modified.

> If polly get's the additional #pragma information, it can remove all
> dependences and it can execute the code in parallel. This is also not
> implemented, but seems doable.

But is it?  Nobody seems to write about it. Makes me wonder why not.

>> This is the sort of code I want to handle -- parallel code written
>> by knowledgable programmers who're counting on the compiler to
>> recognize parallel loops, recognize and rewrite recurrences and
>> reductions, and insert synchronization where helpful, using a
>> variety of directives to clarify their intent.
>
> Sounds very much like what we want to handle with Polly. This does
> not mean we handle all of this already, but we already have some high level
> optimizations based on the Pluto algorithm [2] and more importantly, we
> already are able to abstract away a lot of low level details. So I
> definitely expect significant overlap. Let's see where we can reuse each
> others work.

Sounds good.

> (I am e.g. very interested in array delinearization or in how you pass loop
> pragma-annotations to LLVM-IR)

I'm a long ways from doing these things,
so don't have much to say yet.
It's just the direction I'm headed.

Preston