[polly] r233675 - Do not scale tile loops

Tue Mar 31 08:57:57 PDT 2015

On 03/31/2015 09:52 AM, Tobias Grosser wrote:
> Author: grosser
> Date: Tue Mar 31 02:52:36 2015
> New Revision: 233675
>
> URL: http://llvm.org/viewvc/llvm-project?rev=233675&view=rev
> Log:
> Do not scale tile loops
>
> We now generate tile loops as:
>
>   for (int c1 = 0; c1 <= 47; c1 += 1)
>     for (int c2 = 0; c2 <= 47; c2 += 1)
>       for (int c3 = 0; c3 <= 31; c3 += 1)
>         for (int c4 = 0; c4 <= 31; c4 += 4)
>           #pragma simd
>           for (int c5 = c4; c5 <= c4 + 3; c5 += 1)
>             Stmt_for_body3(32 * c1 + c3, 32 * c2 + c5);
>
> instead of
>
>   for (int c1 = 0; c1 <= 1535; c1 += 32)
>     for (int c2 = 0; c2 <= 1535; c2 += 32)
>       for (int c3 = 0; c3 <= 31; c3 += 1)
>         for (int c4 = 0; c4 <= 31; c4 += 4)
>           #pragma simd
>           for (int c5 = c4; c5 <= c4 + 3; c5 += 1)
>             Stmt_for_body3(c1 + c3, c2 + c5);
>
> Run-time performance-wise this makes little difference, but this gives a large
> reduction in compile time (10-30% on 17 LNT benchmarks). Apparently the isl
> AST generator is not yet very efficient in generating the latter.

On our LNT system, the improvement is a lot less, but still very visible:

http://llvm.org/perf/db_default/v4/nts/25530?num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=median&MW_confidence_lv=0.01&compare_to=25507&submit=Update

And here -O3 -polly compared with -O3

Before:

http://llvm.org/perf/db_default/v4/nts/25507?num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=median&MW_confidence_lv=0.01&compare_to=25525&submit=Update

After:

http://llvm.org/perf/db_default/v4/nts/25530?num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=median&MW_confidence_lv=0.01&compare_to=25525&submit=Update

Even though in the big picture, the overhead of Polly is only 3-5% and 
individual loop kernels commonly compile in 1-2 seconds max, for 
individual loop kernels we still see a compile-time increase of 2-8x. 
This is a combination of isl using the slow imath library, us doing code 
versioning -> more code to compile, as well as a LICM code motion bug.

There are still several low(er) hanging fruits to pick, that should 
especially benefit the singl-loop-only benchmark programs.

Cheers,
Tobias