<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - memcpy in loop with known-zero source not optimized well"

   href="https://bugs.llvm.org/show_bug.cgi?id=32168">32168</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>memcpy in loop with known-zero source not optimized well

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>arielb1@mail.tau.ac.il

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>IR that does a memcpy in a loop out of a memset(0) is optimized quite badly due

to a bad interaction between several passes.

For example, this IR:

    %Biquad = type { double, double, double, double, double, double, double,

double, double }

    declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8*

nocapture readonly, i64, i32, i1)

    declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i32,

i1)

    define void @initme([5 x %Biquad]*) {

    entry-block:

      %temp = alloca %Biquad, align 8

      %temp_i8 = bitcast %Biquad* %temp to i8*

      call void @llvm.memset.p0i8.i64(i8* %temp_i8, i8 0, i64 72, i32 8, i1

false)

      %p0 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64

0

      %pN = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64 0, i64

5

      br label %slice_loop_body

    slice_loop_body:

      %p = phi %Biquad* [ %p0, %entry-block ], [ %p_next, %slice_loop_body ]

      %p_i8 = bitcast %Biquad* %p to i8*

      call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32

8, i1 false)

      %p_next = getelementptr inbounds %Biquad, %Biquad* %p, i64 1

      %cond = icmp eq %Biquad* %p_next, %pN

      br i1 %cond, label %exit, label %slice_loop_body

    exit:

      ret void

    }

which (as can be demonstrated using the pass sequence "-loop-unroll

-simplifycfg -instcombine -memcpyopt -instcombine") is equivalent to a memset:

    define void @initme([5 x %Biquad]*) {

    entry-block:

      %1 = bitcast [5 x %Biquad]* %0 to i8*

      call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 360, i32 8, i1 false)

      ret void

    }

However, if you use the standard -O2 optimization sequence, you get this

terrible result:

    define void @initme([5 x %Biquad]*) local_unnamed_addr #1 {

    entry-block:

      %p_next.1 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64

0, i64 2

      %p_i8.2 = bitcast %Biquad* %p_next.1 to i8*

      %1 = bitcast [5 x %Biquad]* %0 to i8*

      call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 144, i32 8, i1 false)

      call void @llvm.memset.p0i8.i64(i8* %p_i8.2, i8 0, i64 72, i32 8, i1    

false)

      %p_next.2 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64

0, i64 3

      %p_i8.3 = bitcast %Biquad* %p_next.2 to i8*

      call void @llvm.memset.p0i8.i64(i8* %p_i8.3, i8 0, i64 72, i32 8, i1

false)

      %p_next.3 = getelementptr inbounds [5 x %Biquad], [5 x %Biquad]* %0, i64

0, i64 4

      %p_i8.4 = bitcast %Biquad* %p_next.3 to i8*

      call void @llvm.memset.p0i8.i64(i8* %p_i8.4, i8 0, i64 72, i32 8, i1

false)

      ret void

    }

If the array was larger (say, had 100 members), the code generated would have

been proportionally worse.

The reason this happens is:

1) There is no optimization that turns a memcpy initialized in a different

basic block to memset, and in particular LoopIdiomRecognize can't optimize this

out.

2) Loop unrolling happily unrolls this loop, and generates a "chain" of geps:

      %p_i8 = bitcast %Biquad* %p0 to i8*

      call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8, i8* %temp_i8, i64 72, i32

8, i1 false)

      %p_next = getelementptr inbounds %Biquad, %Biquad* %p0, i64 1

      %p_i8.1 = bitcast %Biquad* %p_next to i8*

      call void @llvm.memcpy.p0i8.p0i8.i64(i8* %p_i8.1, i8* %temp_i8, i64 72,

i32 8, i1 false)

      %p_next.1 = getelementptr inbounds %Biquad, %Biquad* %p_next, i64 1

      ...

3) MemCpyOpt does not handle GEP chains well

(<a href="https://github.com/llvm-mirror/llvm/blob/f33a6990794fc06d1e54c1cbecca0afa0a3d7d7a/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L429">https://github.com/llvm-mirror/llvm/blob/f33a6990794fc06d1e54c1cbecca0afa0a3d7d7a/lib/Transforms/Scalar/MemCpyOptimizer.cpp#L429</a>),

so it only merges the first 2 memcpys. This is normally fine, because

InstCombine (which collapses GEP chains) normally runs before MemCpyOpt, but

here unrolling introduces new chains.

The terrible code generated is causing random slowdowns - e.g.

<a href="https://github.com/rust-lang/rust/issues/40267">https://github.com/rust-lang/rust/issues/40267</a>.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>