[PATCH] D120277: [SystemZ] Expand some memcpys/memsets into Load/Store sequences.

Mon Feb 21 12:33:54 PST 2022

jonpa created this revision.
jonpa added a reviewer: uweigand.
Herald added subscribers: ctetreau, steven.zhang, dmgreen, hiraditya.
jonpa requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

For sizes over 16 bytes MVC is not always efficient, so up to a certain limit it would be better to use a Load/Store sequence.

This is still experimental with two different approaches included for now, to show what they look like.

Approach 1:

New TLI hook prefersVectorSplatForMemset(), to avoid the mandatory scalar multiplication as a means of replication, but instead directly generate a splat vector. I think this makes sense, and it is probably even better to just do this whenever target returns a vector type from getOptimalMemOpType().

Approach 2:

Detect replicated bytes in SystemZTargetLowering::combineSTORE().

If getMemsetStores() generates the multiplies (as it does now unaltered), they need to be combined in combineSTORE() . This is used like '-memset-splat=false -byterepl-fix'. This seems to work also, although it is more LOCs. It does seem though as it would be good also on its own - I see if using this *without* expanding any memcpy/memsets:

  vsteh          :                 2557                 2875     +318
  vlrepb         :                  187                  475     +288
  llc            :                39057                38771     -286
  sth            :                25792                25515     -277
  mhi            :                 6009                 5741     -268   // multiply
  stg            :               371885               371803      -82
  vstef          :                 5779                 5859      +80
  st             :               122692               122620      -72
  lay            :                54734                54797      +63
  vsteg          :                 6106                 6159      +53
  lg             :               987456               987405      -51
  sthy           :                 1054                 1014      -40
  vlvgp          :                 8300                 8339      +39
  vrepb          :                   95                  134      +39
  vrepib         :                  283                  320      +37
  msgrkc         :                 6741                 6707      -34   // multiply
  iilf           :                 6397                 6364      -33
  msfi           :                 7106                 7082      -24
  vl             :               107362               107381      +19
  ...
  Spill|Reload   :               611703               611679      -24
  Copies         :              1002825              1002832       +7

  Example:

  -       llc     %r0, 0(%r4)
  -       msrkc   %r0, %r0, %r0
  -       st      %r0, 0(%r1)
  +       vlrepb  %v0, 0(%r4)
  +       vstef   %v0, 0(%r1), 0

Maybe this could be done in the common DAGCombiner even...

LegalAMVecTy and GEPOffsSplit are experimental options I have played with to see how to best fix the problem where a memcpy address is >U12 range, and now we instead get multiple VL/VSTs which all are out of range. This is a problem that needs to be fixed before using this patch, I think. I see for instance:

  stg     %r2, 8696(%r15                  stg     %r2, 8696(%r15
  lg      %r2, 8816(%r15                  lg      %r2, 8816(%r15
                            >             vl      %v0, 0(%r2), 3
  lay     %r1, 8712(%r15                  lay     %r1, 8712(%r15
  mvc     0(44,%r1), 0(%    |             vst     %v0, 0(%r1), 3
                            >             vl      %v0, 16(%r2), 
                            >             lay     %r1, 8728(%r15
                            >             vst     %v0, 0(%r1), 3
                            >             vl      %v0, 28(%r2)
                            >             lay     %r1, 8740(%r15
                            >             vst     %v0, 0(%r1)

The problem right now is that if the vector type is set to reject long displacements in isLegalAddressingMode(), LSR generates worse code for some loops :-/

The MVI_TYPEFIX code is probably not needed above 16 bytes, but in some cases there were immediates of type i64 that first got loaded into a register instead of used directly with MVI.

The test cases for now include full ranges of interesting sizes of memcpy/memset.

https://reviews.llvm.org/D120277

Files:
  llvm/include/llvm/CodeGen/TargetLowering.h
  llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
  llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.h
  llvm/test/CodeGen/SystemZ/memcpy-03.ll
  llvm/test/CodeGen/SystemZ/memset-08.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D120277.410362.patch
Type: text/x-patch
Size: 63394 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220221/21e0fe29/attachment.bin>