[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Mon Feb 4 12:57:23 PST 2013

On 02/04/2013 07:42 PM, Preston Briggs wrote:
>
>
>     ---------- Forwarded message ----------
>     From: Tobias Grosser <tobias at grosser.es <mailto:tobias at grosser.es>>
>     To: Dmitry Mikushin <dmitry at kernelgen.org <mailto:dmitry at kernelgen.org>>
>     Cc: polly-dev <polly-dev at googlegroups.com
>     <mailto:polly-dev at googlegroups.com>>, LLVM Developers Mailing List
>     <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>>
>     Date: Mon, 04 Feb 2013 17:53:11 +0100
>     Subject: Re: [LLVMdev] [Polly] Parallelizing outer loop containing
>     inner reduction loop
>     On 02/04/2013 04:55 PM, Dmitry Mikushin wrote:
>
>         Hi Tobi,
>
>         Thanks for looking into this!
>
>         2013/2/4 Tobias Grosser <tobias at grosser.es
>         <mailto:tobias at grosser.es> <mailto:tobias at grosser.es
>         <mailto:tobias at grosser.es>>>
>
>              In any case, you seemed to have in some way convinced Polly to
>              accept this code. Would you mind to share what you did?
>
>
>         Sure. Aliasing is simply ignored. Instead we have substituted
>         pointers
>         and sizes for arrays and a special pass that converts memory
>         accesses
>         from every scop statement into ISL general form.
>
>
>     Interesting, why did you do this?
>
>         Sorry, we are quite far
>         from standard polly invocation process, maybe I should prepare some
>         simplified plugin for testing purposes...
>
>
>     Or at least state that this is not standard polly. ;-) Then I would
>     not keep trying to reproduce the problem.
>
>              Regarding your problem. As far as I understand, the problem
>         is that
>              the following code:
>
>              for (i
>                 A[i] = 0
>                 for (j
>                     A[i] +=
>                 ... = A[i]
>
>              is changed by gcc (and other optimizations) to:
>
>              for (i
>                 A[i] = 0
>                 tmp = A[i]
>
>                 for (j
>                     tmp +=
>
>                 A[i] = tmp
>                 ... = A[i]
>
>
>         Yes, exactly!
>
>
>              This is a common optimization that unfortunately introduces
>         a lot of
>              dependences on tmp that block any direct parallelization. To
>              parallelize the loop anyway we would need to expand the
>         memory of
>              tmp, such that each parallel thread has a private copy of
>         'tmp'.
>              Deciding where and how to expand memory to enable further
>              transformations is in general difficult such that I would
>         normally
>              run Polly before such optimizations are performed. Tough,
>         in your
>              case it may still be possible to parallelize the loop. To
>         do this,
>              you would need to ignore all dependences that can be removed by
>              creating thread private copies of 'tmp'. If you are
>         interested in
>              having this implemented either open a bug report or give it
>         a try
>              yourself. I am happy to assist.
>
>
>         Hm, how to create thread-private copies of tmp at that point and how
>         useful could it be?
>
>
> Many auto-vectorizing compilers do such a thing (sometimes cry true. alled
> "scalar expansion") as a way to attack otherwise parallel loops.
>
>     There are different approaches how to create thread private copies.
>     You could aim to get back to the code in '1.ll' by creating an array
>     with as
>     many elements as there are iterations in the loop. That should be
>     very general and independent of the platform specific parallel
>     implementation. However, such a general implementation may introduce
>     a lot of additional memory, which requires us to be careful with
>     such transformations.
>
>
> A better solution is to create an array with as many elements as there
> are threads, or, in the case of vectorization, as many elements as the
> length of the vector. Much less expensive!

Very true. If possible we should go this way.

Tobi