[LLVMdev] [Polly]

Sun Feb 3 09:52:48 PST 2013

Dear all,

Yesterday, from the customer's code I observed relatively simple case,
where Polly is unable to detect parallelizable outer loop. Consider two
versions of Fortran routine:

1)

subroutine filter(H, szh, X, szx, Y, szy)

  implicit none
  integer(kind=IKIND), intent(in) :: szh, szx, szy
  real(kind=RKIND), intent(in) :: H(szh), X(szx)
  real(kind=RKIND), intent(out) :: Y(szy)
  integer(kind=IKIND) :: i, j, i1, i2
  integer, parameter :: rkind = RKIND

  do i = 1, szy
    Y(i) = 0.0_rkind
    do j = 1, szh
      Y(i) = Y(i) + X(i + j - 1) * H(j)
    enddo
  enddo

end subroutine filter

2)

subroutine filter(H, szh, X, szx, Y, szy)

  implicit none
  integer(kind=IKIND), intent(in) :: szh, szx, szy
  real(kind=RKIND), intent(in) :: H(szh), X(szx)
  real(kind=RKIND), intent(out) :: Y(szy)
  integer(kind=IKIND) :: i, j, i1, i2
  integer, parameter :: rkind = RKIND

  do i = 1, szx - szh
    Y(i) = 0.0_rkind
    i1 = i
    i2 = i + szh - 1
    Y(i) = Y(i) + sum(X(i1:i2) * H)
  enddo

end subroutine filter

The difference between two versions is only in the way how inner reduction
loop is implemented. In first case it is explicit and in second case - uses
sum intrinsic, which is lowered to plain code by DragonEgg. In first case
Polly successfully detects parallel outer loop, in second case - it can't.

Now let's see how IR codes look like upon their arrival at the beginning of
Polly pass manager in KernelGen: 1.ll and 2.ll. As far as I see, the only
difference between IRs is that in case of 1.ll memory location for
accumulating element is updated after each fma in inner loop, while in 2.ll
accumulator is updated only after the end of inner loop, producing a PHI
node (or reg2mem allocas - with Polly's preopt passes). Interestingly, if
you will try to optimize 1.ll further, it would yield to 2.ll too.

After some attempts to modify the Polly behavior, I concluded it cannot
resist the presence of non-IV PHI-node. Polly attempts to produce
independent basic blocks, trying to gather dependent instructions in single
block and allocating scratch variables. Allocas make the situation even
worse, because they are always placed into the first block of function,
completely killing potential parallelism.

Do you see any possible workarounds in this case? Maybe alloca-s could be
placed inside the inner loop to give it a chance to be parallel?

Thanks,
- D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/a7d73a3b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.ll
Type: application/octet-stream
Size: 2440 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/a7d73a3b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.ll
Type: application/octet-stream
Size: 2122 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/a7d73a3b/attachment-0001.obj>