[PATCH] D106874: [SystemZ] Implement memcpy with variable length with MVC

Tue Jul 27 06:59:36 PDT 2021

jonpa created this revision.
jonpa added a reviewer: uweigand.
Herald added a subscriber: hiraditya.
jonpa requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Instead of making a libcall to memcpy, emit an MVC loop along with an EXRL instruction the same way as already done for memset 0.

It seemed this was a slight overall improvements on preliminary measurements.

I also tried some different prefetch settings on (quick) spec for both Write and Read (compared to master which has only Write 768):

Overall results (by average over benchmarks):

  z14:
  2017_B_Memcpy_pfd_w_0_pfd_r_0                                             99.856 %
  2017_E_Memcpy_pfd_w_2048_pfd_r_0                                          99.920 %
  2017_C_Memcpy_pfd_w_768_pfd_r_768                                         99.978 %
  2017_D_Memcpy_pfd_w_2048_pfd_r_2048                                       99.986 %
  2017_F_Memcpy_pfd_w_524287_pfd_r_524287                                   100.426 %

  z15:
  2017_E_Memcpy_pfd_w_2048_pfd_r_0                                          99.941 %
  2017_B_Memcpy_pfd_w_0_pfd_r_0                                             99.941 %
  2017_D_Memcpy_pfd_w_2048_pfd_r_2048                                       100.043 %
  2017_C_Memcpy_pfd_w_768_pfd_r_768                                         100.053 %
  2017_F_Memcpy_pfd_w_524287_pfd_r_524287                                   100.313 %

I also tried to do a runtime check for a big size like:

  f17:                                    # @f17
          .cfi_startproc
  # %bb.0:
          aghi    %r4, -1
          cgibe   %r4, -1, 0(%r14)
  .LBB16_1:
          srlg    %r0, %r4, 8
          cgije   %r0, 0, .LBB16_4
  # %bb.2:
          lghi    %r1, 0
          cgfi    %r4, 2000000
          locghihe        %r1, 1
          sllg    %r1, %r1, 22
  .LBB16_3:                               # =>This Inner Loop Header: Depth=1
          pfd     2, 0(%r1,%r2)
          mvc     0(256,%r2), 0(%r3)
          la      %r2, 256(%r2)
          la      %r3, 256(%r3)
          brctg   %r0, .LBB16_3
  .LBB16_4:
          exrl    %r4, .Ltmp0
          br      %r14

The idea was to prefetch for the L2 cache (4M), if size was bigger than 2M as a check to see if this could give anything. It however did not seem to improve any benchmark either with W, R, or W+R prefetching per this pattern.

Keeping the prefetching as it was with this patch was slightly better overall on z15 and slightly better without it on z14, so there does not seem to be any major gains to be had from changing the MVC prefetching...

https://reviews.llvm.org/D106874

Files:
  llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
  llvm/lib/Target/SystemZ/SystemZSelectionDAGInfo.cpp
  llvm/test/CodeGen/SystemZ/loop-03.ll
  llvm/test/CodeGen/SystemZ/memcpy-01.ll
  llvm/test/CodeGen/SystemZ/tail-call-mem-intrinsics.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D106874.362016.patch
Type: text/x-patch
Size: 5019 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210727/c6650120/attachment.bin>