[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Mon Aug 19 09:43:25 PDT 2013

At 2013-08-17 23:22:32,"Tobias Grosser" <tobias at grosser.es> wrote:
>On 08/17/2013 12:08 AM, Star Tan wrote:
>> At 2013-08-16 22:32:30,"Tobias Grosser" <tobias at grosser.es> wrote:
>>>>
>>>> Yes, I have changed the original code to the form you suggested:
>>>>     for (i
>>>>         for (j
>>>>             ...
>>>>                  x=1
>>>
>>> Sorry, I meant
>>>                   x[0] +=
>>>
>>
>>
>> It is interesting that Polly would run much faster if we change the "x=0" to "X[0]=0" or "X[0]+=0 in the nested loops.
>
>Indeed, this is interesting.

Experiments show that this is because their dependence results are different.

1) When the loop statement is a array operation such as "X[0]+=0", then clang would generate a store instruction and a load instruction within the same basic blocks like this:
    for (i=0;i<n;i++)
		for (j=0;j<n;j++)
			for (k=0;k<n;k++){
				load %a, memory(A,0)
				add  %a, 1
				store %a, memory(A,0)
			}

Such data dependence can be easily analysed by polly-dependence pass.

2) When the loop statement is a scalar operation such as "x+=1", then clang would generate a serial of scalar operations and phi nodes like this:

	x0=0
	for (i=0; i<n; i++) {
		x1 = phi(x0, x6)
		for (j=0; j<n; j++) {
			x2 = phi(x1, x5)
			for (k=0;k<n;k++){
				x3=phi(x2, x4)
				x4=x3+1
			}
			x5 = phi(x2, x4)
		}
		x6 = phi(x1, x5)
	}
	x7 = phi(x0, x6)

As we can see, the data dependences	are becoming very complex as the depth of loop nest increases, which thus slow the polly-dependence analysis. Unfortunately, I have not find a way to simplify such data dependences. Do you have any idea?

>> First, if the nested loop only contains a statement "X[0]+=0", like this:
>[...]
>
>> The compile time of polly-dependence analysis for the three cases would be:
>> loop with "X[0]+=1":0.130s
>> loop with "X[0] =1": 0.053s
>> loop with "x=1":1.037s
>
>OK. So it seems in case reductions are converted to PHI nodes, we see a 
>lot slower compiles. This is possibly due to the increased number of 
>statements as well as the increased number of data-locations modelled.
>
>I see two directions we can approach this problem with:
>
>1) Optimize isl to work faster
>
>We can see if there are some optimization opportunities in isl to make 
>the code faster

As shown in the above example, the key problem is the complex data dependence of the application, so I do not think it is the time to turn to ISL now.

>
>2) Generate a less complicated SCoP
>
>If possible, detect that the SCoP can be modelled with less statements 
>and less memory locations and do so.

Do you mean we should detect smaller SCoP or represent the same SCoP with less statements and less memory locations?
For the sake of execution performance, we should detect large SCoP to enable more Polly optimizations.

>I did not look into either of the two, so those are basically open 
>questions (that may not be easy to solve). If you are interested, you
>may want to take a look.

Sure, I would like to investigate if they are indeed useful to reduce the compile-time of Polly.

Thanks for your suggestions!
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/c8887abf/attachment.html>