<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - [SCEV] scev expansion generates redundent code in memcheck during vectorization"

   href="https://llvm.org/bugs/show_bug.cgi?id=29065">29065</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[SCEV] scev expansion generates redundent code in memcheck during vectorization

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>wmi@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Testcase 1.c:

-----------------------------------

#define COMBP(d, s, m)     ( ((d) & ~(m)) | ((s) & (m)) )

void foo(unsigned *dd, int dw, int dh, unsigned *ds, int dp, int sp, int sx,

int sy, int dx, int dy, unsigned lwmask) {

  int i, j;

  unsigned *ls, *ld;

  int nw = dw >> 5;

  int lwbits = dw & 31;

  unsigned *psd = ds + sp * sy + (sx >> 5);

  unsigned *pdd = dd + dp * dy + (dx >> 5);

  for (i = 0; i < dh; i++) {

    ls = psd + i * sp;

    ld = pdd + i * dp;

    for (j = 0; j < nw; j++) {

      *ld = (*ls & *ld);

      ld++;

      ls++;

    }

  }

}

------------------------------------

For the memcheck of innerloop vectorization, we expect the code like this:

ls = psd + i * sp;

ld = pdd + i * dp;

if (ls < ld + nw || ld < ls + nw)

  goto conflict. 

However, some computation defining psd and pdd are regenerated inside preheader

of innerloop, and these redundent computations cannot be cleaned up in later

pass.

for.body:                                         ; preds = %for.inc21,

%for.body.lr.ph

  %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.inc21

]

  %10 = mul i64 %0, %indvars.iv

  %11 = add i64 %5, %10

  %scevgep = getelementptr i32, i32* %dd, i64 %11  ==> The add involving %dd

has been computed outside of loop when defining pdd. why not directly use pdd

here.

  %scevgep51 = bitcast i32* %scevgep to i8*

  %12 = add i64 %7, %10

  %scevgep52 = getelementptr i32, i32* %dd, i64 %12  ==> The add involving %dd

has been computed outside of loop when defininng pdd. why not directly use pdd

here.

  %scevgep5253 = bitcast i32* %scevgep52 to i8*

  %13 = mul i64 %1, %indvars.iv

  %14 = add i64 %8, %13

  %scevgep54 = getelementptr i32, i32* %ds, i64 %14  ==> The add involving %ds

has been computed outside of loop when defining psd. why not directly use psd

here.

  %scevgep5455 = bitcast i32* %scevgep54 to i8*

  %15 = add i64 %9, %13

  %scevgep56 = getelementptr i32, i32* %ds, i64 %15  ==> The add involving %ds

has been computed outside of loop when defining psd. why not directly use psd

here.

  %scevgep5657 = bitcast i32* %scevgep56 to i8*

  br i1 %cmp1742, label %for.body18.preheader, label %for.inc21

 ...

vector.memcheck:                                  ; preds = %min.iters.checked

  %bound0 = icmp ule i8* %scevgep51, %scevgep5657

  %bound1 = icmp ule i8* %scevgep5455, %scevgep5253

  %found.conflict = and i1 %bound0, %bound1

  %memcheck.conflict = and i1 %found.conflict, true

Final assembly for the memcheck:

# BB#5:                                 # %vector.memcheck

                                        #   in Loop: Header=BB0_2 Depth=1

        movq    %r9, %r10

        movq    %r13, %r9

        movq    %rbp, %r13

        movq    -48(%rsp), %rdx         # 8-byte Reload

        leaq    (%rdx,%rbx), %rbp

        leaq    (%r13,%rbp,4), %r15

        movq    -56(%rsp), %rdx         # 8-byte Reload

        leaq    (%rdx,%rax), %rdx

        movq    -104(%rsp), %rbp        # 8-byte Reload

        leaq    (%rbp,%rdx,4), %rdx

        cmpq    %rdx, %r15

        movq    -104(%rsp), %rbp        # 8-byte Reload

        ja      .LBB0_8

# BB#6:                                 # %vector.memcheck

                                        #   in Loop: Header=BB0_2 Depth=1

        addq    -72(%rsp), %rbx         # 8-byte Folded Reload

        leaq    (%r13,%rbx,4), %rdx

        addq    -64(%rsp), %rax         # 8-byte Folded Reload

        leaq    (%rbp,%rax,4), %rax

        cmpq    %rdx, %rax

        ja      .LBB0_8

When the iteration number of innerloop is not large enough, the preparation

code for vectorization/unroll in the preheader of innerloop matters. Such

preparation code to compute loop iteration number or generate runtime check is

often generated by SCEV expansion. Here the redundency exists because SCEV

expansion doesn't reuse existing value psd and pdd.

* D12090 and D21313 relieved the reuse problem of SCEV expansion somewhat, but

why they are defeated here?

  New SCEV expr is created after decomposition and combination of original

SCEVs, without those SCEVs appearing inside of the new SCEV expr. An example,

an old SCEVAddExpr S which is {a + b} has an existing value v = a + b

associated with it. When we generate SCEV for expr S + c, a new SCEVAddExpr

with three operands will be created {a + b + c} and no S can be found in the

new SCEVAddExpr. When we expand the new SCEV, it will only generate a + b + c

instead of v + c.

* Why such redundencies cannot be cleaned up by later passes?

  SCEV transformation can make the expanded expr a lot different from the expr

to be reused, for example, SCEV expansion wants to turn things like

ptrtoint+arithmetic+inttoptr into GEP so expr may be aggressively reassociated.

After that, enabling extra-vectorizer-passes and even NaryReassociate cannot

clean up the redundencies.

To solve the problem, adding more facility to enhance the reuse during SCEV

expansion or enhancing cleanup passes, I am wondering which way is better to

persue.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>