[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
Tobias Grosser
tobias at grosser.es
Tue Aug 20 23:15:59 PDT 2013
On 08/19/2013 09:43 AM, Star Tan wrote:
> At 2013-08-17 23:22:32,"Tobias Grosser" <tobias at grosser.es> wrote:
>> On 08/17/2013 12:08 AM, Star Tan wrote:
>>> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:
>>>>>
>>>>> Yes, I have changed the original code to the form you suggested:
>>>>> for (i
>>>>> for (j
>>>>> ...
>>>>> x=1
>>>>
>>>> Sorry, I meant
>>>> x[0] +=
>>>>
>>>
>>>
>>> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
>>
>> Indeed, this is interesting.
>
> Experiments show that this is because their dependence results are different.
>
> 1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this:
> for (i=0;i<n;i++)
> for (j=0;j<n;j++)
> for (k=0;k<n;k++){
> load %a, memory(A,0)
> add %a, 1
> store %a, memory(A,0)
> }
>
> Such data dependence can be easily analysed by polly-dependence pass.
>
> 2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this:
>
> x0=0
> for (i=0; i<n; i++) {
> x1 = phi(x0, x6)
> for (j=0; j<n; j++) {
> x2 = phi(x1, x5)
> for (k=0;k<n;k++){
> x3=phi(x2, x4)
> x4=x3+1
> }
> x5 = phi(x2, x4)
> }
> x6 = phi(x1, x5)
> }
> x7 = phi(x0, x6)
>
> As we can see, the data dependences are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?
See below.
>>> First, if the nested loop only contains a statement "X[0]+=0", like this:
>> [...]
>>
>>> The compile time of polly-dependence analysis for the three cases would be:
>>> loop with "X[0]+=1":0.130s
>>> loop with "X[0] =1": 0.053s
>>> loop with "x=1":1.037s
>>
>> OK. So it seems in case reductions are converted to PHI nodes, we see a
>> lot slower compiles. This is possibly due to the increased number of
>> statements as well as the increased number of data-locations modelled.
>>
>> I see two directions we can approach this problem with:
>>
>> 1) Optimize isl to work faster
>>
>> We can see if there are some optimization opportunities in isl to make
>> the code faster
>
> As shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.
Agreed.
>> 2) Generate a less complicated SCoP
>>
>> If possible, detect that the SCoP can be modelled with less statements
>> and less memory locations and do so.
>
> Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations?
> For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.
I mean we should detect the same SCoP, but we should check if there is
an easy way to understand that those PHI nodes basically just model a
reduction and that we can represent this reduction with less scop
statements (which then again yields less dependences).
Sorry, I am the next couple of days pretty busy. Please feel free to
look by yourself into this. Maybe you have an idea.
Tobi
More information about the llvm-dev
mailing list