[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Tue Aug 20 23:15:59 PDT 2013

On 08/19/2013 09:43 AM, Star Tan wrote:
> At 2013-08-17 23:22:32,"Tobias Grosser" <tobias at grosser.es> wrote:
>> On 08/17/2013 12:08 AM, Star Tan wrote:
>>> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:
>>>>>
>>>>> Yes, I have changed the original code to the form you suggested:
>>>>>      for (i
>>>>>          for (j
>>>>>              ...
>>>>>                   x=1
>>>>
>>>> Sorry, I meant
>>>>                    x[0] +=
>>>>
>>>
>>>
>>> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
>>
>> Indeed, this is interesting.
>
> Experiments show that this is because their dependence results are different.
>
> 1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this:
>      for (i=0;i<n;i++)
> 		for (j=0;j<n;j++)
> 			for (k=0;k<n;k++){
> 				load %a, memory(A,0)
> 				add  %a, 1
> 				store %a, memory(A,0)
> 			}
>
> Such data dependence can be easily analysed by polly-dependence pass.
>
> 2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this:
>
> 	x0=0
> 	for (i=0; i<n; i++) {
> 		x1 = phi(x0, x6)
> 		for (j=0; j<n; j++) {
> 			x2 = phi(x1, x5)
> 			for (k=0;k<n;k++){
> 				x3=phi(x2, x4)
> 				x4=x3+1
> 			}
> 			x5 = phi(x2, x4)
> 		}
> 		x6 = phi(x1, x5)
> 	}
> 	x7 = phi(x0, x6)
>
> As we can see, the data dependences	are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?

See below.

>>> First, if the nested loop only contains a statement "X[0]+=0", like this:
>> [...]
>>
>>> The compile time of polly-dependence analysis for the three cases would be:
>>> loop with "X[0]+=1":0.130s
>>> loop with "X[0] =1": 0.053s
>>> loop with "x=1":1.037s
>>
>> OK. So it seems in case reductions are converted to PHI nodes, we see a
>> lot slower compiles. This is possibly due to the increased number of
>> statements as well as the increased number of data-locations modelled.
>>
>> I see two directions we can approach this problem with:
>>
>> 1) Optimize isl to work faster
>>
>> We can see if there are some optimization opportunities in isl to make
>> the code faster
>
> As shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.

Agreed.

>> 2) Generate a less complicated SCoP
>>
>> If possible, detect that the SCoP can be modelled with less statements
>> and less memory locations and do so.
>
> Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations?
> For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.

I mean we should detect the same SCoP, but we should check if there is 
an easy way to understand that those PHI nodes basically just model a 
reduction and that we can represent this reduction with less scop 
statements (which then again yields less dependences).

Sorry, I am the next couple of days pretty busy. Please feel free to 
look by yourself into this. Maybe you have an idea.

Tobi