Hi,<br><br>Here's another case, different in high-level, but similar in low-level. When Fortran allocatable array is defined in module, its actual dimensions are kept in internal structure. Loads originated from reading these dimensions confuse Polly on any use of this array.<br>

<br>Attachments:<br>1) Sample Fortran source code (to be compiled with and without -DMODULE to see failing and working version, respectively).<br>2) LLVM IR for both cases right before Polly analysis (initialization loop "array = 2")<br>

<br>Below is diff for quick look:<br><br>marcusmae@M17xR4:~/forge/kernelgen/tests/behavior/module_array$ diff -u array.loop.9.ll module_array.loop.9.ll <br>--- array.loop.9.ll    2013-01-04 01:37:40.312259953 +0100<br>+++ module_array.loop.9.ll    2013-01-04 01:37:50.036259544 +0100<br>

@@ -12,34 +12,39 @@<br>   br label %"10.cloned"<br> <br> "16.cloned":                                      ; preds = %"15.cloned"<br>-  %1 = add i64 %indvar3, 1<br>-  %exitcond12 = icmp eq i64 %1, 64<br>

-  br i1 %exitcond12, label %"17.exitStub", label %"10.cloned"<br>+  %1 = add i64 %indvar1, 1<br>+  %exitcond9 = icmp eq i64 %1, 64<br>+  br i1 %exitcond9, label %"17.exitStub", label %"10.cloned"<br>

 <br> "10.cloned":                                      ; preds = %"Loop Function Root.split", %"16.cloned"<br>-  %indvar3 = phi i64 [ 0, %"Loop Function Root.split" ], [ %1, %"16.cloned" ]<br>

-  %2 = mul i64 %indvar3, 4096<br>+  %indvar1 = phi i64 [ 0, %"Loop Function Root.split" ], [ %1, %"16.cloned" ]<br>+  %indvar.next2 = add i64 %indvar1, 1<br>+  %2 = load i64* inttoptr (i64 47280713800 to i64*), align 8<br>

+  %3 = mul i64 %2, %indvar.next2<br>+  %4 = add i64 %3, -4160<br>   br label %"12.cloned"<br> <br> "15.cloned":                                      ; preds = %"14.cloned"<br>-  %3 = add i64 %indvar1, 1<br>

-  %exitcond = icmp eq i64 %3, 64<br>+  %5 = add i64 %indvar3, 1<br>+  %exitcond = icmp eq i64 %5, 64<br>   br i1 %exitcond, label %"16.cloned", label %"12.cloned"<br> <br> "12.cloned":                                      ; preds = %"10.cloned", %"15.cloned"<br>

-  %indvar1 = phi i64 [ 0, %"10.cloned" ], [ %3, %"15.cloned" ]<br>-  %4 = mul i64 %indvar1, 64<br>-  %5 = add i64 %2, %4<br>+  %indvar3 = phi i64 [ 0, %"10.cloned" ], [ %5, %"15.cloned" ]<br>

+  %indvar.next4 = add i64 %indvar3, 1<br>+  %6 = load i64* inttoptr (i64 47280713776 to i64*), align 16<br>+  %7 = mul i64 %6, %indvar.next4<br>+  %8 = add i64 %4, %7<br>   br label %"14.cloned"<br> <br> "14.cloned":                                      ; preds = %"12.cloned", %"14.cloned"<br>

-  %indvar = phi i64 [ 0, %"12.cloned" ], [ %7, %"14.cloned" ]<br>-  %6 = add i64 %5, %indvar<br>-  %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %6<br>+  %indvar = phi i64 [ 0, %"12.cloned" ], [ %10, %"14.cloned" ]<br>

+  %9 = add i64 %8, %indvar<br>+  %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %9<br>   store float 2.000000e+00, float* %scevgep, align 4<br>-  %7 = add i64 %indvar, 1<br>-  %exitcond9 = icmp eq i64 %7, 64<br>

-  br i1 %exitcond9, label %"15.cloned", label %"14.cloned"<br>+  %10 = add i64 %indvar, 1<br>+  %exitcond7 = icmp eq i64 %10, 64<br>+  br i1 %exitcond7, label %"15.cloned", label %"14.cloned"<br>

 <br> "17.exitStub":                                    ; preds = %"16.cloned"<br>   ret void<br><br><div class="gmail_quote">2013/1/2 Dmitry Mikushin <span dir="ltr"><<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>></span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Duncan & Tobi,<br><br>Thanks a lot for your interest, and for pointing out differences in GIMPLE I missed.<br><br>

Attached is simplified test case. Is it good?<br><br>Tobi, regarding runtime alias analysis: in KernelGen we already do it along with runtime values substitution. For example:<br>

<br><span style="font-family:courier new,monospace"><------------------ __kernelgen_main_loop_17: compile started ---------------------><br>    Integer args substituted:<br>        offset = 32, ptrValue = 47248855040<br>


        offset = 40, ptrValue = 47246749696<br>        offset = 48, ptrValue = 47247802368<br>        offset = 16, value = 64<br>        offset = 20, value = 64<br>        offset = 24, value = 64<br>MemoryAccess to pointer: float* inttoptr (i64 47246749696 to float*)<br>


    { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47246749696 to float*)[4096i0 + 64i1 + i2] }<br>        allocSize: 4 storeSize: 4<br>    replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1 + 4i2 }<br>


MemoryAccess to pointer: float* inttoptr (i64 47247802368 to float*)<br>    { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47247802368 to float*)[4096i0 + 64i1 + i2] }<br>        allocSize: 4 storeSize: 4<br>    replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1 + 4i2 }<br>


MemoryAccess to pointer: float* inttoptr (i64 47248855040 to float*)<br>    { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47248855040 to float*)[4096i0 + 64i1 + i2] }<br>        allocSize: 4 storeSize: 4<br>    replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1 + 4i2 }<br>


<br>    Number of good nested parallel loops: 3<br>    Average size of loops: 64 64 64<br><br><------------------------------ Scop: end -----------------------------------><br><br><------------------------------ Scop: start ---------------------------------><br>


<------------------- Cloog AST of Scop -------------------><br>for (c2=0;c2<=63;c2++) {<br>  for (c4=0;c4<=63;c4++) {<br>    for (c6=0;c6<=63;c6++) {<br>      Stmt__12_cloned_(c2,c4,c6);<br>    }<br>  }<br>


}<br><---------------------------------------------------------><br>    Context:<br>    {  :  }<br>    Statements {<br>        Stmt__12_cloned_<br>            Domain :=<br>                { Stmt__12_cloned_[i0, i1, i2] : i0 >= 0 and i0 <= 63 and i1 >= 0 and i1 <= 63 and i2 >= 0 and i2 <= 63 };<br>


            Scattering :=<br>                { Stmt__12_cloned_[i0, i1, i2] -> scattering[0, i0, 0, i1, 0, i2, 0] };<br>            ReadAccess := <br>                { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1 + 4i2 };<br>


            ReadAccess := <br>                { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1 + 4i2 };<br>            WriteAccess := <br>


                { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >= 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1 + 4i2 };<br>    }<br><------------------------------ Scop: end -----------------------------------><br>


<------------------------------ Scop: dependences ---------------------------><br>Write after read dependences: <br>    {  }<br>Read after write dependences: <br>    {  }<br>Write after write dependences: <br>    {  }<br>


    loop is parallel<br>        loop is parallel<br>            loop is parallel<br><------------------------------ Scop: dependences end -----------------------><br>1 polly-detect - Number of regions that a valid part of Scop<br>


<------------------ __kernelgen_main_loop_17: compile completed -------------------></span><br><br>It works pretty well in many situations, but in this particular case it does not help. Those problematic "Fortran scalar values referred by pointers" (FSVRPs) can only substituted (replaced by actual value) after proper memory analysis. According to current design, memory analysis operates on SCoPs, but Polly is already unable to detect SCoP for the whole group of nested loops due to presence of those FSVRPs. So, chicken and egg problem.<span class="HOEnZb"><font color="#888888"><br>


<br>- D.</font></span><div class="HOEnZb"><div class="h5"><br><br><div class="gmail_quote">2013/1/2 Tobias Grosser <span dir="ltr"><<a href="mailto:tobias@grosser.es" target="_blank">tobias@grosser.es</a>></span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><div>On 01/01/2013 02:45 PM, Duncan Sands wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Dmitry,<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

In our compiler we use a modified version LLVM Polly, which is very<br>

sensitive to<br>

proper code generation. Among the number of limitations, the loop region<br>

(enclosed by phi node on induction variable and branch) is required to<br>

be free<br>

of additional memory-dependent branches. In other words, there must be no<br>

conditional "br" instructions below phi nodes. The problem we are<br>

facing is that<br>

from *identical* GIMPLE for 3d loop used in different contexts<br>

DragonEgg may<br>

generate LLVM IR either conforming the described limitation, or<br>

violating it.<br>

</blockquote>

<br>

the gimple isn't the same at all (see below).  The differences are directly<br>

reflected in the unoptimized LLVM IR, turning up as additional memory loads<br>

in the "bad" version.  In addition, the Fortran isn't really the same<br>

either:<br>

Fortran semantics allows the compiler to assume that the parameters of your<br>

new function "compute" (which are all passed in by reference, i.e. as<br>

pointers)<br>

do not alias each other or anything else in sight (i.e. they get the<br>

"restrict"<br>

qualifier in the gimple, noalias in the LLVM IR).  Thus by factorizing<br>

the loop<br>

into "compute" you are actually giving the compiler more information.<br>

<br>

Summary:<br>

   (1) as far as I can see the unoptimized LLVM IR is a direct<br>

reflection of<br>

the gimple: the differences for the loop part come directly from<br>

differences<br>

in the gimple;<br>

   (2) the optimizers do a better good when you have "compute" partly<br>

because you<br>

provided them with additional aliasing information; this better optimized<br>

version then gets inlined into MAIN__.<br>

   (3) this leaves the question of whether in the bad version it is<br>

logically<br>

possible for the optimizers to deduce the same aliasing information as is<br>

handed to them for free in the good version.  To work this out it would be<br>

nice to have a smaller testcase.<br>

</blockquote>

<br></div></div>

I would also be interested in a minimal test case. If e.g. only the alias check is missing, we could introduce run-time alias checks such that Polly would be able to optimize both versions. It is probably not as simple, but a reduced test case would make it easier to figure out the exact problems.<br>


<br>

Thanks<br>

Tobi<div><div><br>

<br>

______________________________<u></u>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/<u></u>mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br>

</div></div></blockquote></div><br>