[LLVMdev] [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?

Sat Feb 9 14:46:53 PST 2013

Hi,

So, from LLVM IR point of view, Polly is confused by loads of loops
dimensions right from the corresponding loops headers:

"162":                                            ; preds = %"168",
%"162.preheader"
  %391 = phi i32 [ %437, %"168" ], [ 1, %"162.preheader" ]
  %392 = load i32* %ny, align 4

"163":                                            ; preds = %"166",
%"163.preheader"
  %400 = phi i32 [ %435, %"166" ], [ 1, %"163.preheader" ]
  %401 = load i32* %nx, align 4

In aliasing_f.f90 (2D loop case) problematic nx load could be taken out of
the loop PHI-node block using the following sequence of LLVM optimizations:

                {
                        PassManager manager;
                        manager.add(new TargetData(m));
                        manager.add(createBasicAliasAnalysisPass());
                        manager.add(createInstructionCombiningPass());
                        manager.add(createCFGSimplificationPass());
                        manager.add(createScalarReplAggregatesPass());
                        manager.add(createSimplifyLibCallsPass());
                        manager.add(createInstructionCombiningPass());
                        manager.add(createCFGSimplificationPass());
                        manager.add(createLoopRotatePass());
                        manager.add(createLICMPass());
                        manager.run(*m);
                }

Unfortunately, this helps only for 2D loops. Even LLVM's -O3 is unable to
pull loads out of 3D loop (the case of original demo_f.f90 test). The only
solutions seems to be

dragonegg -O1 -fplugin-arg-dragonegg-enable-gcc-optzns demo_f.f90

I.e. only adding some of GCC's own optimization helps.

- D.

2013/1/2 Dmitry Mikushin <dmitry at kernelgen.org>

> Hi Duncan & Tobi,
>
> Thanks a lot for your interest, and for pointing out differences in GIMPLE
> I missed.
>
> Attached is simplified test case. Is it good?
>
> Tobi, regarding runtime alias analysis: in KernelGen we already do it
> along with runtime values substitution. For example:
>
> <------------------ __kernelgen_main_loop_17: compile started
> --------------------->
>     Integer args substituted:
>         offset = 32, ptrValue = 47248855040
>         offset = 40, ptrValue = 47246749696
>         offset = 48, ptrValue = 47247802368
>         offset = 16, value = 64
>         offset = 20, value = 64
>         offset = 24, value = 64
> MemoryAccess to pointer: float* inttoptr (i64 47246749696 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47246749696 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1
> + 4i2 }
> MemoryAccess to pointer: float* inttoptr (i64 47247802368 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47247802368 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1
> + 4i2 }
> MemoryAccess to pointer: float* inttoptr (i64 47248855040 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47248855040 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1
> + 4i2 }
>
>     Number of good nested parallel loops: 3
>     Average size of loops: 64 64 64
>
> <------------------------------ Scop: end
> ----------------------------------->
>
> <------------------------------ Scop: start
> --------------------------------->
> <------------------- Cloog AST of Scop ------------------->
> for (c2=0;c2<=63;c2++) {
>   for (c4=0;c4<=63;c4++) {
>     for (c6=0;c6<=63;c6++) {
>       Stmt__12_cloned_(c2,c4,c6);
>     }
>   }
> }
> <--------------------------------------------------------->
>     Context:
>     {  :  }
>     Statements {
>         Stmt__12_cloned_
>             Domain :=
>                 { Stmt__12_cloned_[i0, i1, i2] : i0 >= 0 and i0 <= 63 and
> i1 >= 0 and i1 <= 63 and i2 >= 0 and i2 <= 63 };
>             Scattering :=
>                 { Stmt__12_cloned_[i0, i1, i2] -> scattering[0, i0, 0, i1,
> 0, i2, 0] };
>             ReadAccess :=
>                 { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1
> + 4i2 };
>             ReadAccess :=
>                 { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1
> + 4i2 };
>             WriteAccess :=
>                 { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >=
> 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1
> + 4i2 };
>     }
> <------------------------------ Scop: end
> ----------------------------------->
> <------------------------------ Scop: dependences
> --------------------------->
> Write after read dependences:
>     {  }
> Read after write dependences:
>     {  }
> Write after write dependences:
>     {  }
>     loop is parallel
>         loop is parallel
>             loop is parallel
> <------------------------------ Scop: dependences end
> ----------------------->
> 1 polly-detect - Number of regions that a valid part of Scop
> <------------------ __kernelgen_main_loop_17: compile completed
> ------------------->
>
> It works pretty well in many situations, but in this particular case it
> does not help. Those problematic "Fortran scalar values referred by
> pointers" (FSVRPs) can only substituted (replaced by actual value) after
> proper memory analysis. According to current design, memory analysis
> operates on SCoPs, but Polly is already unable to detect SCoP for the whole
> group of nested loops due to presence of those FSVRPs. So, chicken and egg
> problem.
>
> - D.
>
>
> 2013/1/2 Tobias Grosser <tobias at grosser.es>
>
>> On 01/01/2013 02:45 PM, Duncan Sands wrote:
>>
>>> Hi Dmitry,
>>>
>>>
>>>> In our compiler we use a modified version LLVM Polly, which is very
>>>> sensitive to
>>>> proper code generation. Among the number of limitations, the loop region
>>>> (enclosed by phi node on induction variable and branch) is required to
>>>> be free
>>>> of additional memory-dependent branches. In other words, there must be
>>>> no
>>>> conditional "br" instructions below phi nodes. The problem we are
>>>> facing is that
>>>> from *identical* GIMPLE for 3d loop used in different contexts
>>>> DragonEgg may
>>>> generate LLVM IR either conforming the described limitation, or
>>>> violating it.
>>>>
>>>
>>> the gimple isn't the same at all (see below).  The differences are
>>> directly
>>> reflected in the unoptimized LLVM IR, turning up as additional memory
>>> loads
>>> in the "bad" version.  In addition, the Fortran isn't really the same
>>> either:
>>> Fortran semantics allows the compiler to assume that the parameters of
>>> your
>>> new function "compute" (which are all passed in by reference, i.e. as
>>> pointers)
>>> do not alias each other or anything else in sight (i.e. they get the
>>> "restrict"
>>> qualifier in the gimple, noalias in the LLVM IR).  Thus by factorizing
>>> the loop
>>> into "compute" you are actually giving the compiler more information.
>>>
>>> Summary:
>>>    (1) as far as I can see the unoptimized LLVM IR is a direct
>>> reflection of
>>> the gimple: the differences for the loop part come directly from
>>> differences
>>> in the gimple;
>>>    (2) the optimizers do a better good when you have "compute" partly
>>> because you
>>> provided them with additional aliasing information; this better optimized
>>> version then gets inlined into MAIN__.
>>>    (3) this leaves the question of whether in the bad version it is
>>> logically
>>> possible for the optimizers to deduce the same aliasing information as is
>>> handed to them for free in the good version.  To work this out it would
>>> be
>>> nice to have a smaller testcase.
>>>
>>
>> I would also be interested in a minimal test case. If e.g. only the alias
>> check is missing, we could introduce run-time alias checks such that Polly
>> would be able to optimize both versions. It is probably not as simple, but
>> a reduced test case would make it easier to figure out the exact problems.
>>
>> Thanks
>> Tobi
>>
>>
>> ______________________________**_________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/**mailman/listinfo/llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130209/36abf4d9/attachment.html>