[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Sun Feb 3 09:55:11 PST 2013

Oops, sorry for the message title, making it more descriptive now...

2013/2/3 Dmitry Mikushin <dmitry at kernelgen.org>

> Dear all,
>
> Yesterday, from the customer's code I observed relatively simple case,
> where Polly is unable to detect parallelizable outer loop. Consider two
> versions of Fortran routine:
>
> 1)
>
> subroutine filter(H, szh, X, szx, Y, szy)
>
>   implicit none
>   integer(kind=IKIND), intent(in) :: szh, szx, szy
>   real(kind=RKIND), intent(in) :: H(szh), X(szx)
>   real(kind=RKIND), intent(out) :: Y(szy)
>   integer(kind=IKIND) :: i, j, i1, i2
>   integer, parameter :: rkind = RKIND
>
>   do i = 1, szy
>     Y(i) = 0.0_rkind
>     do j = 1, szh
>       Y(i) = Y(i) + X(i + j - 1) * H(j)
>     enddo
>   enddo
>
> end subroutine filter
>
> 2)
>
> subroutine filter(H, szh, X, szx, Y, szy)
>
>   implicit none
>   integer(kind=IKIND), intent(in) :: szh, szx, szy
>   real(kind=RKIND), intent(in) :: H(szh), X(szx)
>   real(kind=RKIND), intent(out) :: Y(szy)
>   integer(kind=IKIND) :: i, j, i1, i2
>   integer, parameter :: rkind = RKIND
>
>   do i = 1, szx - szh
>     Y(i) = 0.0_rkind
>     i1 = i
>     i2 = i + szh - 1
>     Y(i) = Y(i) + sum(X(i1:i2) * H)
>   enddo
>
> end subroutine filter
>
> The difference between two versions is only in the way how inner reduction
> loop is implemented. In first case it is explicit and in second case - uses
> sum intrinsic, which is lowered to plain code by DragonEgg. In first case
> Polly successfully detects parallel outer loop, in second case - it can't.
>
> Now let's see how IR codes look like upon their arrival at the beginning
> of Polly pass manager in KernelGen: 1.ll and 2.ll. As far as I see, the
> only difference between IRs is that in case of 1.ll memory location for
> accumulating element is updated after each fma in inner loop, while in 2.ll
> accumulator is updated only after the end of inner loop, producing a PHI
> node (or reg2mem allocas - with Polly's preopt passes). Interestingly, if
> you will try to optimize 1.ll further, it would yield to 2.ll too.
>
> After some attempts to modify the Polly behavior, I concluded it cannot
> resist the presence of non-IV PHI-node. Polly attempts to produce
> independent basic blocks, trying to gather dependent instructions in single
> block and allocating scratch variables. Allocas make the situation even
> worse, because they are always placed into the first block of function,
> completely killing potential parallelism.
>
> Do you see any possible workarounds in this case? Maybe alloca-s could be
> placed inside the inner loop to give it a chance to be parallel?
>
> Thanks,
> - D.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/4b288119/attachment.html>