[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
Tobias Grosser
tobias at grosser.es
Mon Feb 4 12:57:23 PST 2013
On 02/04/2013 07:42 PM, Preston Briggs wrote:
>
>
> ---------- Forwarded message ----------
> From: Tobias Grosser <tobias at grosser.es <mailto:tobias at grosser.es>>
> To: Dmitry Mikushin <dmitry at kernelgen.org <mailto:dmitry at kernelgen.org>>
> Cc: polly-dev <polly-dev at googlegroups.com
> <mailto:polly-dev at googlegroups.com>>, LLVM Developers Mailing List
> <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>>
> Date: Mon, 04 Feb 2013 17:53:11 +0100
> Subject: Re: [LLVMdev] [Polly] Parallelizing outer loop containing
> inner reduction loop
> On 02/04/2013 04:55 PM, Dmitry Mikushin wrote:
>
> Hi Tobi,
>
> Thanks for looking into this!
>
> 2013/2/4 Tobias Grosser <tobias at grosser.es
> <mailto:tobias at grosser.es> <mailto:tobias at grosser.es
> <mailto:tobias at grosser.es>>>
>
> In any case, you seemed to have in some way convinced Polly to
> accept this code. Would you mind to share what you did?
>
>
> Sure. Aliasing is simply ignored. Instead we have substituted
> pointers
> and sizes for arrays and a special pass that converts memory
> accesses
> from every scop statement into ISL general form.
>
>
> Interesting, why did you do this?
>
> Sorry, we are quite far
> from standard polly invocation process, maybe I should prepare some
> simplified plugin for testing purposes...
>
>
> Or at least state that this is not standard polly. ;-) Then I would
> not keep trying to reproduce the problem.
>
> Regarding your problem. As far as I understand, the problem
> is that
> the following code:
>
> for (i
> A[i] = 0
> for (j
> A[i] +=
> ... = A[i]
>
> is changed by gcc (and other optimizations) to:
>
> for (i
> A[i] = 0
> tmp = A[i]
>
> for (j
> tmp +=
>
> A[i] = tmp
> ... = A[i]
>
>
> Yes, exactly!
>
>
> This is a common optimization that unfortunately introduces
> a lot of
> dependences on tmp that block any direct parallelization. To
> parallelize the loop anyway we would need to expand the
> memory of
> tmp, such that each parallel thread has a private copy of
> 'tmp'.
> Deciding where and how to expand memory to enable further
> transformations is in general difficult such that I would
> normally
> run Polly before such optimizations are performed. Tough,
> in your
> case it may still be possible to parallelize the loop. To
> do this,
> you would need to ignore all dependences that can be removed by
> creating thread private copies of 'tmp'. If you are
> interested in
> having this implemented either open a bug report or give it
> a try
> yourself. I am happy to assist.
>
>
> Hm, how to create thread-private copies of tmp at that point and how
> useful could it be?
>
>
> Many auto-vectorizing compilers do such a thing (sometimes cry true. alled
> "scalar expansion") as a way to attack otherwise parallel loops.
>
> There are different approaches how to create thread private copies.
> You could aim to get back to the code in '1.ll' by creating an array
> with as
> many elements as there are iterations in the loop. That should be
> very general and independent of the platform specific parallel
> implementation. However, such a general implementation may introduce
> a lot of additional memory, which requires us to be careful with
> such transformations.
>
>
> A better solution is to create an array with as many elements as there
> are threads, or, in the case of vectorization, as many elements as the
> length of the vector. Much less expensive!
Very true. If possible we should go this way.
Tobi
More information about the llvm-dev
mailing list