[llvm-bugs] [Bug 33225] New: Lops with memcpy suffer from poor LSR treatment on SystemZ.

Tue May 30 07:37:10 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=33225

            Bug ID: 33225
           Summary: Lops with memcpy suffer from poor LSR treatment on
                    SystemZ.
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: paulsson at linux.vnet.ibm.com
                CC: llvm-bugs at lists.llvm.org

Created attachment 18538
  --> https://bugs.llvm.org/attachment.cgi?id=18538&action=edit
reduced testcase

This is the same issue as with vector load / store: only a 12 bit displacement
is supported with MVC, which implements the memcpy.

I tried to extend LSR to treat this the same way as with a Load or Store, but
that does not work, since it seems that a memcpy Fixup does not get any Offset,
but it is rather the Formula (of type Basic) which has an UnfoldedOffset.

I tried then with

@@ -1290,6 +1290,11 @@ void Cost::RateFormula(const TargetTransformInfo &TTI,
     if ((isa<LoadInst>(Fixup.UserInst) || isa<StoreInst>(Fixup.UserInst)) &&
         !TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
       NumBaseAdds++;
+
+    if (Offset == 0 && F.UnfoldedOffset != 0 &&
+        isa<MemCpyInst>(Fixup.UserInst) &&
+        !TTI.isFoldableMemAccessOffset(Fixup.UserInst, F.UnfoldedOffset))
+      NumBaseAdds++;

, but this did only handle a very few cases. It did add the right cost, but it
seemed that there was no other formula to be rated higher. 

It turns out that even though this was a fairly small single-block loop
containing basically 8 memcpy calls, the EstimateSearchSpaceComplexity() still
return a too high value, so the formulas with the foldable offsets
unfortunately got pruned.

The loop becomes:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r1)
    mvc    0(48,%r1), 0(%r3)
    lay    %r3, -288(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -240(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -192(%r2)
    mvc    0(48,%r3), 0
    lay    %r3, -144(%r2)
    mvc    0(48,%r3), 0(%r1)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r2), 0
    lay    %r1, 8192(%r1)
    la    %r2, 384(%r2)
    j    .LBB0_2

with another heuristic for narrowing the search space (-lsr-exp-narrow), it
seems the better formula is still there:

.LBB0_2:                                # %for.body75
                                        # =>This Inner Loop Header: Depth=1
    lay    %r3, -4096(%r2)
    mvc    0(48,%r1), 0(%r3)
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0
    mvc    0(48,%r1), 0(%r2)
    mvc    0(48), 0
    mvc    0(48), 0
    mvc    0(48,%r1), 0
    lay    %r2, 8192(%r2)
    la    %r1, 384(%r1)
    j    .LBB0_2

This actually works even without my little patch per above. Comparing
-lsr-exp-narrow across SPEC, it increases the total number of load-adress
instructions (unfolded offsets), with or without my patch, so it does not seem
good to just switch generally.

Is there any possibility of squeezing in the offset heuristic somewhere in this
pruning process? Is there anything else here that I missed?

Run reduced test case with

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll

llc -O3 -mtriple=s390x-linux-gnu -mcpu=z13 ./tc_mvcs.ll -lsr-exp-narrow

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170530/a0ffee6a/attachment-0001.html>