[PATCH] D55365: [CodeGen] Allow mempcy/memset to generate small overlapping stores.

Tue Dec 11 04:56:58 PST 2018

courbet marked 3 inline comments as done.
courbet added a comment.

Thanks for the comments !

================
Comment at: test/CodeGen/AArch64/arm64-memcpy-inline.ll:20
+; CHECK: ldur [[REG0:w[0-9]+]], [x[[BASEREG:[0-9]+]], #7]
+; CHECK: stur [[REG0]], [x[[BASEREG2:[0-9]+]], #7]
 ; CHECK: ldr [[REG2:x[0-9]+]],
----------------
pcordes wrote:
> It's normally best to do both loads before either store.  (Like glibc's memcpy does).  This allows the same code to work for memmove.
> 
> But there are some microarchitectural advantages even without true overlap.
> 
> If memory disambiguation on any AArch64 CPUs works like on Intel's x86 chips, if src and dst are offset by a multiple of 4k, the 2nd load will will be detected as possibly having a dependency on the unaligned store.
> 
> (Because they overlap based on the low 12 bits of the address: offset within a page.  The HW looks for partial matches first and then verifies because wider content-addressable memory in the store buffer would be expensive.  Probably also because it starts checking before the TLB translation of the page-number bits is available.)
> 
> https://software.intel.com/en-us/vtune-amplifier-help-4k-aliasing
> 
> https://software.intel.com/sites/default/files/managed/04/59/TuningGuide_IntelXeonProcessor_ScalableFamily_1stGen.pdf describes the penalty on Intel SKX as ~7 cycle extra latency to replay the load, with potentially worse penalties when it involves a cache-line split for an unaligned load.  (So there's a throughput cost from replaying the load up, as well as latency.)
> 
> In-order uarches may benefit even more from doing both loads then both stores, hiding more of the latency.
> 
> ----
> 
> I think this is doing an aligned store for the first 8 bytes of the data being copied; that's good, the most likely consumer of a memcpy is an aligned load from the start of the buffer.  Doing that store last allows store-forwarding to succeed, because all the data comes from one store.
> 
> So probably we want to do the load of that data first, making a chain of memcpy efficient.
> 
> e.g. for a 15-byte copy, we might want:
> 
> ```
>     ldr  x10, [x0]
>     ldur  x11, [x0, #7]
>     stur  x11, [x1, #7]
>     str  x10, [x1]
> ```
> 
> The equivalent of that on x86 is probably best for Intel and AMD's store-forwarding rules.
Thanks, I've filed PR39953.

================
Comment at: test/CodeGen/X86/memset-zero.ll:327
 ; CORE2:       # %bb.0: # %entry
-; CORE2-NEXT:    movb $0, 34(%rdi)
-; CORE2-NEXT:    movw $0, 32(%rdi)
+; CORE2-NEXT:    movl $0, 31(%rdi)
 ; CORE2-NEXT:    movq $0, 24(%rdi)
----------------
pcordes wrote:
> In 64-bit mode, Intel CPUs won't micro-fuse an instruction that has an immediate and a rip-relative addressing mode.
> 
> So if this was a static object being memset instead of a pointer in a register, each `mov` instruction would decode to 2 separate fused-domain uops (store-address and store-data.
> 
> This would make it *definitely* worth it to zero a register and use `movl %ecx, 31+buf(%rip)`, `movq %rcx, 24+buf(%rip)`, etc.  Even on Core2 where register-read stalls can be a problem, this is unlikely to hurt because it's written right before being read.
> 
> Of course you also have the option of doing a RIP-relative LEA (7 bytes) to save 3 byte per instruction (reg+disp8 instead of RIP+rel32.  But for static data you know the alignment so you can use `movaps` for all the aligned parts so you hopefully have few total instructions.
IIRU this would be addressed if we fix PR24448, right ?

================
Comment at: test/CodeGen/X86/unaligned-load.ll:39
 ; CORE2-NEXT:    movq %rsi, -{{[0-9]+}}(%rsp)
 ; CORE2-NEXT:    jmp LBB0_1
 ;
----------------
pcordes wrote:
> We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.
> 
> We can use `movaps` for the aligned part at least, even if we use scalar stores for the unaligned start/end (potentially overlapping the vector store).
> 
> `movaps` is fast on any CPU that has it; only `movups` is slow on pre-Nehalem.
> 
> Storing the first few bytes of an object with a 4-byte `mov`-immediate might be bad for store-forwarding if it's an array or struct of 8-byte elements.  (But the x86-64 System V ABI requires that any array on the stack outside a struct has 16-byte alignment if it's at least 16 bytes in size.  So misaligned arrays that we access relative to RSP should normally only happen inside a struct.)
> We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.

Good point. I've added a test in this file to show what happens when the data is aligned: we're also failing to select movabs. I've filed PR39952.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55365/new/

https://reviews.llvm.org/D55365