[PATCH] D112004: [SystemZ] Improve codegen for memset

Mon Oct 18 08:40:06 PDT 2021

jonpa created this revision.
jonpa added a reviewer: uweigand.
Herald added a subscriber: hiraditya.
jonpa requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Memset with a constant length was implemented with a single store followed by series of MVC:s. This patch changes this so that one store of the byte is emitted for each MVC, which avoids data dependencies between the MVCs. An MVI/STC + MVC(len-1) is done for each block.

(A few somewhat unfortunate changes in memset-01.ll (extra STC/MVC) - hopefully not important. E.g. STC;MVC(256) -> STC;MVC(255);STC)

Memset with a variable is now also handled without a libcall. Since the byte is first stored and then MVC is used from that address, a length of two must now be subtracted instead of one for the loop. There needs to be an extra check for the case of length one is handled in a special block with just a single MVI/STC (this is per what GCC does).

I put the MBB for doing the memset of length 1 at the end of MF and also set a low probability on that edge since otherwise the Branch Probability Basic Block Placement pass would move it back into the sequence. When that happens the common path (length > 1) then involves a taken branch across the length-1 block which should be suboptimal.

I see that GCC prefetches these loops 4 iterations ahead, while clang does 3...

Side note: Unfortunately it seems that memset loops suffer a bit currently from poor register coalescing in cases where the address is used after the memset. This is not just for this patch, but also for the memset-0 case (XC loop). I guess that should be handled if possible...

  char fun(char *dst, char byte, int len)
  {
    memset(dst, byte, len);
    return dst[14];
  }

  =>

  fun:                                    # @fun
  # %bb.0:                                # %entry
  	aghi	%r5, -2
  	cgije	%r5, -2, .LBB0_6
  # %bb.1:                                # %entry
  	cgije	%r5, -1, .LBB0_7
  # %bb.2:                                # %entry
  	srlg	%r0, %r5, 8
  	lgr	%r1, %r2                ##### <<<<<<<<<<<<<<<
  	cgije	%r0, 0, .LBB0_5
  # %bb.3:
  	lgr	%r1, %r2                ##### <<<<<<<<<<<<<<< REDUNDANT
  .LBB0_4:                                # %entry
                                          # =>This Inner Loop Header: Depth=1
  	pfd	2, 768(%r1)
  	stc	%r4, 0(%r1)
  	mvc	1(255,%r1), 0(%r1)
  	la	%r1, 256(%r1)
  	brctg	%r0, .LBB0_4
  ...

So far, I know that Early Machine LICM inserts a pre-header for the loop since there is none (a preheader can only have one successor: the loop header). This causes the PHI node using the original address register in the loop to have a different predecessor than the one after the loop (EXRL block), so there are two COPYs inserted into different MBBs by the PHI lowering pass. Not sure yet why this doesn't get coalesced in the memset cases, when it is handled for memcpy...

https://reviews.llvm.org/D112004

Files:
  llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.h
  llvm/lib/Target/SystemZ/SystemZInstrFormats.td
  llvm/lib/Target/SystemZ/SystemZInstrInfo.td
  llvm/lib/Target/SystemZ/SystemZOperators.td
  llvm/lib/Target/SystemZ/SystemZSelectionDAGInfo.cpp
  llvm/test/CodeGen/SystemZ/memset-01.ll
  llvm/test/CodeGen/SystemZ/memset-02.ll
  llvm/test/CodeGen/SystemZ/memset-04.ll
  llvm/test/CodeGen/SystemZ/memset-07.ll
  llvm/test/CodeGen/SystemZ/tail-call-mem-intrinsics.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D112004.380412.patch
Type: text/x-patch
Size: 29972 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20211018/51debbea/attachment.bin>