[LLVMdev] Vectorization: Next Steps

Wed Feb 8 02:23:11 PST 2012

On 02/07/2012 07:04 PM, Preston Briggs wrote:
>>> The 1st loop is trivially parallel.  I think Polly would recognize
>>> this and do good things.
>>
>> This case is trivial.
>>
>> But keep in mind that unsigned loop ivs and integers can cause modulo
>> wrapping which is not trivial to handle - Both Polly, but also any other
>> tool, need to take into account additional (possibly% non-uniform)
>> dependences that may appear in addition to the dependences for non-wrapping
>> integers. (In LLVM even signed integers can have modulo wrapping, so
>> actually we need to always be careful here).
>
> Sure.  In this case, it would affect the code that determines the
> number of iterations.
> If the loop terminates, there's no dependence.

This becomes interesting if you have an upper bound that is
a + b. Keep in mind that you need to calculate this module an some 
integer value that is derived from the type size. This becomes tricky 
when fusing loops.

>>> The 2nd loop has a race condition that can be handled by using an
>>> atomic increment provided by the architecture, if the compiler knows
>>> about such things. I don't think Polly would help here.
>>
>>> for (unsigned i = 0; i<  n; i++)
>>>    count[src[i]]++;
>>
>> I do not see why it should be difficult to teach polyhedral tools to
>> model an increment and to understand that it can be done atomically.
>
> When I read about polyhedral frame works, the first thing I see is
> the restriction to affine subscripts. For loops 2 and 4, this seems to be
> the stumbling block.

You need affine subscripts to get exact dependence information (meaning 
to know statically which statement instance has written the value read 
by another statement instance). As this information is not available at 
compile time, neither a polyhedral framework nor any other framework can 
derive it statically. In such a case you can over approximate. For the 
example above this would be a read of count[p], where p is an unknown 
value. In case you have additional information you may get away with 
less over approximation. E.g. you could take into account that p is even 
or that you know src[i] is a bijective function.

Just because you can represent affine subscripts very detailed, does not 
mean non affine subscripts cannot be worked with at all. They just 
provide less information.

>> The big selling point of polyhedral transformations is uniform handling
>> of several transformations.
>
> Right.  In olden times, people use a bunch of independent loop xforms,
> composed in different ways to achieve different effects.  Then came the
> unimodular xforms which could achieve the effect of combining several
> xforms (but not all). The came the polyhedral approach, which could do
> quite a bit more, including the actual dependence analysis, as long as
> your problem fit in the framework.
 >
> I think the sorting examples don't fit within the framework.

Actually they are pretty close. At least from the ones that I have seen.

> Indeed, I think it'd be tough to express a sort (something easy, like a
> vector of unsigned values) in a way that Polly would parallelize all the loops.

Knowing if Polly would parallelize some code is hard to judge without 
seeing the code, but as Polly just recently has grown the first complex 
transformations I would doubt it. The more interesting question is how 
much of the stuff in Polly can be reused, even though Polly does not do
this transformation (and may never do it). Maybe you can use the 
RegionInfo infrastructure from LLVM which was developed for Polly, some 
of the canonicalization passes, the dependency analysis or you can steal 
ideas from the OpenMP code generation.

As there is a lot of overlap, we should make sure to share code. I just 
realized that I would love to use an independent, generic OpenMP code 
generation in Polly. Let me know if you start work in this area

>> Another reason may be GPU code generation. Here you often use techniques
>> like tiling to create the thread groups. Within the polyhedral model this is
>> a very simple transformation. On LLVM-IR that may become complex.
>
> Yes, of course Polly is good for many things, but I use reductions and
> recurrences every day; I must have a good way to handle them.

So how does such a representation look like?

 From my point of view reductions are not very special. Basically a 
sequence of statement iterations that work on the same data element. If 
the operation applied is associative some nice optimizations can be 
performed and some dependences can be removed.

I need to read the paper more thoroughly to get the exact definition and 
constraints of (bounded) recurrences, but from looking at your code I 
believe it can be detected and matched with Polly. The transformation 
you apply do not seem to be a pure scheduling transformation, but rather 
introduces new statements and may even create non affine loops. This is 
more complex to handle and maybe not even possible or clever to do in a 
polyhedral framework. I will try to have a close look into the paper to 
understand what transformations you are doing.

>>> The 3rd loop is a linear recurrence that can be rewritten and solved
>>> in parallel, if the compiler knows about such things. I don't think
>>> Polly would help here.
>>
>> Polly has not perform this transformation yet. But it may still be helpful
>> as it abstracts away the low level details of LLVM-IR and allows you to
>> focus on the higher level problem.
>
> That's the way I think of the dependence graph. It gets me above the IR
> and lets me study the flow of dependences.

OK. So let me know a little bit more about your dependence graph. What 
is a statement in your graph? An LLVM-IR instruction? Or do you group 
them somehow?

>>> start[0] = 0;
>>> for (unsigned i = 1; i<  buckets; i++)
>>>    start[i] = start[i - 1] + count[i - 1];
>>
>> The previous code will be translated to:
>>
>> Statements: Init[]; Body[i] : 1<= i<  buckets
>>
>> Writes: Init[] ->  start[0]; Body[i] ->  start[i];
>> Reades: Body[i] ->  start[i-1]; Body[i] ->  count[i-1];
>>
>> When we additionally extract the information about the operation performed
>> by body
>>
>> The idea is to go away from low level pattern matching, but to use a
>> high level approach to implement the transformations proposed by you.
>> For code that does not fit into the polyhedral model, this is obviously
>> not possible, but the examples you have given so far can be represented
>> easily.
>>
>>> The 4th loop relies on the programmer to use both an explicit
>>> assertion and an explicit fetch-and-add intrinsic. I think both of
>>> these are outside of Polly's scope.
>>
>> We do not yet use this within Polly, but I think asking the programmer
>> to provide additional knowledge (when available) is more than reasonable.
>
> Yep.  My doubts arose because nobody ever mentions any way to take
> directives into account when analyzing loops with the polyhedral approach.

It is not directly obvious how you would pass this information to 
LLVM-IR. But for source to source tools people commonly use annotations 
to provide information about live-in, live-out values, the sizes of 
arrays, ... It makes perfect sense to state the absence of dependences.

>>> #pragma assert parallel
>>> for (unsigned i = 0; i<  n; i++) {
>>>    unsigned loc = int_fetch_add(&start[src[i]], 1);
>>>   dst[loc] = src[i];
>>> }
>>
>> As the int_fetch_add is side effect free, it is fully
>> polyhedral. It can be implemented with the relevant LLVM
>> atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions,
>> but adding support for them should not be too difficult.
>
> The non-affine subscript is a problem, isn't it?

See above. It just limits how detailed we can analyze the problem. As 
the user stated that there are no dependences in the loop, there is  no 
need for exact dependences. The loop can be parallelized even with a 
conservative over approximation of the dependences.

> And do we really consider the int_fetch_add side-effect free?
> After all, memory is modified.

If you implement it with the atomicrmw instruction only the memory that 
is pointed to is modified. side-effect free does not mean nothing is 
touched at all, but that we know about the side effects.

>> If polly get's the additional #pragma information, it can remove all
>> dependences and it can execute the code in parallel. This is also not
>> implemented, but seems doable.
>
> But is it?  Nobody seems to write about it. Makes me wonder why not.

The problem is not removing the dependences. This is simple. In a source 
to source tool, we could do it easily. Passing such information from a C 
loop to a LLVM-IR loop is more difficult (With or without Polly). There 
is no explicit LLVM-IR so you may need to add meta-data to the loop 
header or the loop induction variable. That can work, but is a little 
difficult to maintain through transformations.

>>> This is the sort of code I want to handle -- parallel code written
>>> by knowledgable programmers who're counting on the compiler to
>>> recognize parallel loops, recognize and rewrite recurrences and
>>> reductions, and insert synchronization where helpful, using a
>>> variety of directives to clarify their intent.
>>
>> Sounds very much like what we want to handle with Polly. This does
>> not mean we handle all of this already, but we already have some high level
>> optimizations based on the Pluto algorithm [2] and more importantly, we
>> already are able to abstract away a lot of low level details. So I
>> definitely expect significant overlap. Let's see where we can reuse each
>> others work.
>
> Sounds good.
>
>> (I am e.g. very interested in array delinearization or in how you pass loop
>> pragma-annotations to LLVM-IR)
>
> I'm a long ways from doing these things,
> so don't have much to say yet.
> It's just the direction I'm headed.

Just keep me up to date.

Tobi