[PATCH] D32002: [X86] Improve large struct pass by value performance

Tue Apr 18 02:53:30 PDT 2017

courbet marked 2 inline comments as done.
courbet added inline comments.

================
Comment at: lib/Target/X86/X86.td:281
+          "fast-string", "HasFastString", "true",
+          "REP MOVS/STOS are fast">;
+
----------------
RKSimon wrote:
> Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy into account as well?
There are two sides to this flag:
 - Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
 - Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters.

In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes  35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput  (although it's not clear to me how to do it here technically).

Anyway the decision criterion for the AlwaysInline case is the code size, not the performance. This PR just improves throughput in all cases by using the right instruction given the microarchitecture. In another PR, I'll address the non-AlwaysInline case (memcpy(a, b, <constexpr>)), where we've seen large improvements on larger sizes (we're still working on the measurements). 

================
Comment at: lib/Target/X86/X86.td:508
   FeatureBMI2,
+  FeatureFastString,
   FeatureFMA,
----------------
RKSimon wrote:
> Is this a Haswell feature in particular or the only target that has been tested?
I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too, so I have not added it there again.

================
Comment at: lib/Target/X86/X86SelectionDAGInfo.cpp:232
   MVT AVT;
-  if (Align & 1)
+  if (Subtarget.hasFastString())
+    // If the target has fast strings, then it's at least as fast to use
----------------
RKSimon wrote:
> OptSize?
Do you mean we should also use repmovs instead of copies when optimizing for size ? We could, but remember that this comes at a large performance code (see comment above), I' not sure how much we want to compromise in OptSize in general.

https://reviews.llvm.org/D32002