[llvm-bugs] [Bug 29065] New: [SCEV] scev expansion generates redundent code in memcheck during vectorization

Fri Aug 19 14:15:26 PDT 2016

https://llvm.org/bugs/show_bug.cgi?id=29065

            Bug ID: 29065
           Summary: [SCEV] scev expansion generates redundent code in
                    memcheck during vectorization
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: wmi at google.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Testcase 1.c:
-----------------------------------
#define COMBP(d, s, m)     ( ((d) & ~(m)) | ((s) & (m)) )

void foo(unsigned *dd, int dw, int dh, unsigned *ds, int dp, int sp, int sx,
int sy, int dx, int dy, unsigned lwmask) {
  int i, j;
  unsigned *ls, *ld;
  int nw = dw >> 5;
  int lwbits = dw & 31;
  unsigned *psd = ds + sp * sy + (sx >> 5);
  unsigned *pdd = dd + dp * dy + (dx >> 5);
  for (i = 0; i < dh; i++) {
    ls = psd + i * sp;
    ld = pdd + i * dp;
    for (j = 0; j < nw; j++) {
      *ld = (*ls & *ld);
      ld++;
      ls++;
    }
  }
}
------------------------------------

For the memcheck of innerloop vectorization, we expect the code like this:
ls = psd + i * sp;
ld = pdd + i * dp;
if (ls < ld + nw || ld < ls + nw)
  goto conflict. 

However, some computation defining psd and pdd are regenerated inside preheader
of innerloop, and these redundent computations cannot be cleaned up in later
pass.

for.body:                                         ; preds = %for.inc21,
%for.body.lr.ph
  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc21
]
  %10 = mul i64 %0, %indvars.iv
  %11 = add i64 %5, %10
  %scevgep = getelementptr i32, i32* %dd, i64 %11  ==> The add involving %dd
has been computed outside of loop when defining pdd. why not directly use pdd
here.
  %scevgep51 = bitcast i32* %scevgep to i8*
  %12 = add i64 %7, %10
  %scevgep52 = getelementptr i32, i32* %dd, i64 %12  ==> The add involving %dd
has been computed outside of loop when defininng pdd. why not directly use pdd
here.
  %scevgep5253 = bitcast i32* %scevgep52 to i8*
  %13 = mul i64 %1, %indvars.iv
  %14 = add i64 %8, %13
  %scevgep54 = getelementptr i32, i32* %ds, i64 %14  ==> The add involving %ds
has been computed outside of loop when defining psd. why not directly use psd
here.
  %scevgep5455 = bitcast i32* %scevgep54 to i8*
  %15 = add i64 %9, %13
  %scevgep56 = getelementptr i32, i32* %ds, i64 %15  ==> The add involving %ds
has been computed outside of loop when defining psd. why not directly use psd
here.
  %scevgep5657 = bitcast i32* %scevgep56 to i8*
  br i1 %cmp1742, label %for.body18.preheader, label %for.inc21
 ...
vector.memcheck:                                  ; preds = %min.iters.checked
  %bound0 = icmp ule i8* %scevgep51, %scevgep5657
  %bound1 = icmp ule i8* %scevgep5455, %scevgep5253
  %found.conflict = and i1 %bound0, %bound1
  %memcheck.conflict = and i1 %found.conflict, true

Final assembly for the memcheck:
# BB#5:                                 # %vector.memcheck
                                        #   in Loop: Header=BB0_2 Depth=1
        movq    %r9, %r10
        movq    %r13, %r9
        movq    %rbp, %r13
        movq    -48(%rsp), %rdx         # 8-byte Reload
        leaq    (%rdx,%rbx), %rbp
        leaq    (%r13,%rbp,4), %r15
        movq    -56(%rsp), %rdx         # 8-byte Reload
        leaq    (%rdx,%rax), %rdx
        movq    -104(%rsp), %rbp        # 8-byte Reload
        leaq    (%rbp,%rdx,4), %rdx
        cmpq    %rdx, %r15
        movq    -104(%rsp), %rbp        # 8-byte Reload
        ja      .LBB0_8
# BB#6:                                 # %vector.memcheck
                                        #   in Loop: Header=BB0_2 Depth=1
        addq    -72(%rsp), %rbx         # 8-byte Folded Reload
        leaq    (%r13,%rbx,4), %rdx
        addq    -64(%rsp), %rax         # 8-byte Folded Reload
        leaq    (%rbp,%rax,4), %rax
        cmpq    %rdx, %rax
        ja      .LBB0_8

When the iteration number of innerloop is not large enough, the preparation
code for vectorization/unroll in the preheader of innerloop matters. Such
preparation code to compute loop iteration number or generate runtime check is
often generated by SCEV expansion. Here the redundency exists because SCEV
expansion doesn't reuse existing value psd and pdd.

* D12090 and D21313 relieved the reuse problem of SCEV expansion somewhat, but
why they are defeated here?

  New SCEV expr is created after decomposition and combination of original
SCEVs, without those SCEVs appearing inside of the new SCEV expr. An example,
an old SCEVAddExpr S which is {a + b} has an existing value v = a + b
associated with it. When we generate SCEV for expr S + c, a new SCEVAddExpr
with three operands will be created {a + b + c} and no S can be found in the
new SCEVAddExpr. When we expand the new SCEV, it will only generate a + b + c
instead of v + c.

* Why such redundencies cannot be cleaned up by later passes?

  SCEV transformation can make the expanded expr a lot different from the expr
to be reused, for example, SCEV expansion wants to turn things like
ptrtoint+arithmetic+inttoptr into GEP so expr may be aggressively reassociated.
After that, enabling extra-vectorizer-passes and even NaryReassociate cannot
clean up the redundencies.

To solve the problem, adding more facility to enhance the reuse during SCEV
expansion or enhancing cleanup passes, I am wondering which way is better to
persue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160819/0f2dbdf0/attachment-0001.html>