<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Lops with memcpy suffer from poor LSR treatment on SystemZ."

   href="https://bugs.llvm.org/show_bug.cgi?id=33225">33225</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Lops with memcpy suffer from poor LSR treatment on SystemZ.

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>paulsson@linux.vnet.ibm.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=18538" name="attach_18538" title="reduced testcase">attachment 18538</a> <a href="attachment.cgi?id=18538&action=edit" title="reduced testcase">[details]</a></span>

reduced testcase

This is the same issue as with vector load / store: only a 12 bit displacement

is supported with MVC, which implements the memcpy.

I tried to extend LSR to treat this the same way as with a Load or Store, but

that does not work, since it seems that a memcpy Fixup does not get any Offset,

but it is rather the Formula (of type Basic) which has an UnfoldedOffset.

I tried then with

@@ -1290,6 +1290,11 @@ void Cost::RateFormula(const TargetTransformInfo &TTI,

     if ((isa<LoadInst>(Fixup.UserInst) || isa<StoreInst>(Fixup.UserInst)) &&

         !TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))

       NumBaseAdds++;

+

+    if (Offset == 0 && F.UnfoldedOffset != 0 &&

+        isa<MemCpyInst>(Fixup.UserInst) &&

+        !TTI.isFoldableMemAccessOffset(Fixup.UserInst, F.UnfoldedOffset))

+      NumBaseAdds++;

, but this did only handle a very few cases. It did add the right cost, but it

seemed that there was no other formula to be rated higher. 

It turns out that even though this was a fairly small single-block loop

containing basically 8 memcpy calls, the EstimateSearchSpaceComplexity() still

return a too high value, so the formulas with the foldable offsets

unfortunately got pruned.

The loop becomes:

.LBB0_2:                                # %for.body75

                                        # =>This Inner Loop Header: Depth=1

    lay    %r3, -4096(%r1)

    mvc    0(48,%r1), 0(%r3)

    lay    %r3, -288(%r2)

    mvc    0(48,%r3), 0

    lay    %r3, -240(%r2)

    mvc    0(48,%r3), 0

    lay    %r3, -192(%r2)

    mvc    0(48,%r3), 0

    lay    %r3, -144(%r2)

    mvc    0(48,%r3), 0(%r1)

    mvc    0(48), 0

    mvc    0(48), 0

    mvc    0(48,%r2), 0

    lay    %r1, 8192(%r1)

    la    %r2, 384(%r2)

    j    .LBB0_2

with another heuristic for narrowing the search space (-lsr-exp-narrow), it

seems the better formula is still there:

.LBB0_2:                                # %for.body75

                                        # =>This Inner Loop Header: Depth=1

    lay    %r3, -4096(%r2)

    mvc    0(48,%r1), 0(%r3)

    mvc    0(48,%r1), 0

    mvc    0(48,%r1), 0

    mvc    0(48,%r1), 0

    mvc    0(48,%r1), 0(%r2)

    mvc    0(48), 0

    mvc    0(48), 0

    mvc    0(48,%r1), 0

    lay    %r2, 8192(%r2)

    la    %r1, 384(%r1)

    j    .LBB0_2

This actually works even without my little patch per above. Comparing

-lsr-exp-narrow across SPEC, it increases the total number of load-adress

instructions (unfolded offsets), with or without my patch, so it does not seem

good to just switch generally.

Is there any possibility of squeezing in the offset heuristic somewhere in this

pruning process? Is there anything else here that I missed?

Run reduced test case with

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll -lsr-exp-narrow</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>