[PATCH] D55365: [CodeGen] Allow mempcy/memset to generate small overlapping stores.
Clement Courbet via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Dec 11 04:56:58 PST 2018
courbet marked 3 inline comments as done.
courbet added a comment.
Thanks for the comments !
================
Comment at: test/CodeGen/AArch64/arm64-memcpy-inline.ll:20
+; CHECK: ldur [[REG0:w[0-9]+]], [x[[BASEREG:[0-9]+]], #7]
+; CHECK: stur [[REG0]], [x[[BASEREG2:[0-9]+]], #7]
; CHECK: ldr [[REG2:x[0-9]+]],
----------------
pcordes wrote:
> It's normally best to do both loads before either store. (Like glibc's memcpy does). This allows the same code to work for memmove.
>
> But there are some microarchitectural advantages even without true overlap.
>
> If memory disambiguation on any AArch64 CPUs works like on Intel's x86 chips, if src and dst are offset by a multiple of 4k, the 2nd load will will be detected as possibly having a dependency on the unaligned store.
>
> (Because they overlap based on the low 12 bits of the address: offset within a page. The HW looks for partial matches first and then verifies because wider content-addressable memory in the store buffer would be expensive. Probably also because it starts checking before the TLB translation of the page-number bits is available.)
>
> https://software.intel.com/en-us/vtune-amplifier-help-4k-aliasing
>
> https://software.intel.com/sites/default/files/managed/04/59/TuningGuide_IntelXeonProcessor_ScalableFamily_1stGen.pdf describes the penalty on Intel SKX as ~7 cycle extra latency to replay the load, with potentially worse penalties when it involves a cache-line split for an unaligned load. (So there's a throughput cost from replaying the load up, as well as latency.)
>
> In-order uarches may benefit even more from doing both loads then both stores, hiding more of the latency.
>
> ----
>
> I think this is doing an aligned store for the first 8 bytes of the data being copied; that's good, the most likely consumer of a memcpy is an aligned load from the start of the buffer. Doing that store last allows store-forwarding to succeed, because all the data comes from one store.
>
> So probably we want to do the load of that data first, making a chain of memcpy efficient.
>
> e.g. for a 15-byte copy, we might want:
>
> ```
> ldr x10, [x0]
> ldur x11, [x0, #7]
> stur x11, [x1, #7]
> str x10, [x1]
> ```
>
> The equivalent of that on x86 is probably best for Intel and AMD's store-forwarding rules.
Thanks, I've filed PR39953.
================
Comment at: test/CodeGen/X86/memset-zero.ll:327
; CORE2: # %bb.0: # %entry
-; CORE2-NEXT: movb $0, 34(%rdi)
-; CORE2-NEXT: movw $0, 32(%rdi)
+; CORE2-NEXT: movl $0, 31(%rdi)
; CORE2-NEXT: movq $0, 24(%rdi)
----------------
pcordes wrote:
> In 64-bit mode, Intel CPUs won't micro-fuse an instruction that has an immediate and a rip-relative addressing mode.
>
> So if this was a static object being memset instead of a pointer in a register, each `mov` instruction would decode to 2 separate fused-domain uops (store-address and store-data.
>
> This would make it *definitely* worth it to zero a register and use `movl %ecx, 31+buf(%rip)`, `movq %rcx, 24+buf(%rip)`, etc. Even on Core2 where register-read stalls can be a problem, this is unlikely to hurt because it's written right before being read.
>
> Of course you also have the option of doing a RIP-relative LEA (7 bytes) to save 3 byte per instruction (reg+disp8 instead of RIP+rel32. But for static data you know the alignment so you can use `movaps` for all the aligned parts so you hopefully have few total instructions.
IIRU this would be addressed if we fix PR24448, right ?
================
Comment at: test/CodeGen/X86/unaligned-load.ll:39
; CORE2-NEXT: movq %rsi, -{{[0-9]+}}(%rsp)
; CORE2-NEXT: jmp LBB0_1
;
----------------
pcordes wrote:
> We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.
>
> We can use `movaps` for the aligned part at least, even if we use scalar stores for the unaligned start/end (potentially overlapping the vector store).
>
> `movaps` is fast on any CPU that has it; only `movups` is slow on pre-Nehalem.
>
> Storing the first few bytes of an object with a 4-byte `mov`-immediate might be bad for store-forwarding if it's an array or struct of 8-byte elements. (But the x86-64 System V ABI requires that any array on the stack outside a struct has 16-byte alignment if it's at least 16 bytes in size. So misaligned arrays that we access relative to RSP should normally only happen inside a struct.)
> We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.
Good point. I've added a test in this file to show what happens when the data is aligned: we're also failing to select movabs. I've filed PR39952.
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D55365/new/
https://reviews.llvm.org/D55365
More information about the llvm-commits
mailing list