[PATCH] D55365: [CodeGen] Allow mempcy/memset to generate small overlapping stores.

Fri Dec 7 21:58:01 PST 2018

pcordes added a comment.

Looks pretty good.  Many of the things I pointed out aren't *problems* with this patch so much as room for further improvement.

But I think we should definitely look at doing both loads first, before either store, like I commented on the AArch64 asm.

Maybe assign a small cost to doing loadu/storeu, load/store, so register pressure can result in that code-gen, but in cases of no register pressure we get load/loadu, storeu/store.

If there are any uarches (maybe some in-order ARM/AArch64?) where it's a lot worse to alternate than on high-end Intel/AMD, we'll want to tune that cost or just force it to always free up a 2nd register for tmp data.

================
Comment at: test/CodeGen/AArch64/arm64-memcpy-inline.ll:20
+; CHECK: ldur [[REG0:w[0-9]+]], [x[[BASEREG:[0-9]+]], #7]
+; CHECK: stur [[REG0]], [x[[BASEREG2:[0-9]+]], #7]
 ; CHECK: ldr [[REG2:x[0-9]+]],
----------------
It's normally best to do both loads before either store.  (Like glibc's memcpy does).  This allows the same code to work for memmove.

But there are some microarchitectural advantages even without true overlap.

If memory disambiguation on any AArch64 CPUs works like on Intel's x86 chips, if src and dst are offset by a multiple of 4k, the 2nd load will will be detected as possibly having a dependency on the unaligned store.

(Because they overlap based on the low 12 bits of the address: offset within a page.  The HW looks for partial matches first and then verifies because wider content-addressable memory in the store buffer would be expensive.  Probably also because it starts checking before the TLB translation of the page-number bits is available.)

https://software.intel.com/en-us/vtune-amplifier-help-4k-aliasing

https://software.intel.com/sites/default/files/managed/04/59/TuningGuide_IntelXeonProcessor_ScalableFamily_1stGen.pdf describes the penalty on Intel SKX as ~7 cycle extra latency to replay the load, with potentially worse penalties when it involves a cache-line split for an unaligned load.  (So there's a throughput cost from replaying the load up, as well as latency.)

In-order uarches may benefit even more from doing both loads then both stores, hiding more of the latency.

----

I think this is doing an aligned store for the first 8 bytes of the data being copied; that's good, the most likely consumer of a memcpy is an aligned load from the start of the buffer.  Doing that store last allows store-forwarding to succeed, because all the data comes from one store.

So probably we want to do the load of that data first, making a chain of memcpy efficient.

e.g. for a 15-byte copy, we might want:

```
    ldr  x10, [x0]
    ldur  x11, [x0, #7]
    stur  x11, [x1, #7]
    str  x10, [x1]
```

The equivalent of that on x86 is probably best for Intel and AMD's store-forwarding rules.

================
Comment at: test/CodeGen/X86/memset-zero.ll:314
 ; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    movb $0, 34(%eax)
-; X86-NEXT:    movw $0, 32(%eax)
+; X86-NEXT:    movl $0, 31(%eax)
 ; X86-NEXT:    movl $0, 28(%eax)
----------------
There's a code-size vs. uop count tradeoff here.  Zeroing one register with a 2-byte `xor %edx,%edx` would save 4 bytes in each of following `movl $imm32` instructions.

Especially on CPUs without a uop-cache, it may well be a win to have one extra cheap uop go though the pipeline to avoid decode bottlenecks that might limit how far ahead the CPU can "see" in the instruction stream.

================
Comment at: test/CodeGen/X86/memset-zero.ll:327
 ; CORE2:       # %bb.0: # %entry
-; CORE2-NEXT:    movb $0, 34(%rdi)
-; CORE2-NEXT:    movw $0, 32(%rdi)
+; CORE2-NEXT:    movl $0, 31(%rdi)
 ; CORE2-NEXT:    movq $0, 24(%rdi)
----------------
In 64-bit mode, Intel CPUs won't micro-fuse an instruction that has an immediate and a rip-relative addressing mode.

So if this was a static object being memset instead of a pointer in a register, each `mov` instruction would decode to 2 separate fused-domain uops (store-address and store-data.

This would make it *definitely* worth it to zero a register and use `movl %ecx, 31+buf(%rip)`, `movq %rcx, 24+buf(%rip)`, etc.  Even on Core2 where register-read stalls can be a problem, this is unlikely to hurt because it's written right before being read.

Of course you also have the option of doing a RIP-relative LEA (7 bytes) to save 3 byte per instruction (reg+disp8 instead of RIP+rel32.  But for static data you know the alignment so you can use `movaps` for all the aligned parts so you hopefully have few total instructions.

================
Comment at: test/CodeGen/X86/unaligned-load.ll:39
 ; CORE2-NEXT:    movq %rsi, -{{[0-9]+}}(%rsp)
 ; CORE2-NEXT:    jmp LBB0_1
 ;
----------------
We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.

We can use `movaps` for the aligned part at least, even if we use scalar stores for the unaligned start/end (potentially overlapping the vector store).

`movaps` is fast on any CPU that has it; only `movups` is slow on pre-Nehalem.

Storing the first few bytes of an object with a 4-byte `mov`-immediate might be bad for store-forwarding if it's an array or struct of 8-byte elements.  (But the x86-64 System V ABI requires that any array on the stack outside a struct has 16-byte alignment if it's at least 16 bytes in size.  So misaligned arrays that we access relative to RSP should normally only happen inside a struct.)

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55365/new/

https://reviews.llvm.org/D55365