[PATCH] D32002: [X86] Improve large struct pass by value performance

Tue Apr 18 06:50:09 PDT 2017

andreadb added a comment.

> There are two sides to this flag:
> 
>   Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
>   Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

As a side note:
Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

@RKSimon, 
We don't want to have that feature for Btver2. On Btver2 we want to always use the largest operand size for MOVS. According to the amd fam15h opt guide:

  Always move data using the largest operand size possible. For example, in 32-bit applications, use
  REP MOVSD rather than REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather
  than REP STOSW, and REP STOSW rather than REP STOSB.
  In 64-bit mode, a quadword data size is available and offers better performance (for example,
  REP MOVSQ and REP STOSQ).

> The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters.

On Btver2, there is a very high initialization cost for REP MOVS (in my experiments, the overhead is around ~40cy). I agree with @courbet when he writes that, unfortunately, runtime parameters, alignment constraints, cache effects heavily affect the performance of unrolled memcpy kernels. On Btver2, I remember that, for some large (iirc. up to 4KB) over-aligned data structures, a loop of vmov was still outperforming a REP MOVS. So, it is very difficult to compute a generally "good" break-even point.

> In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically).

I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not. That check would be dependent on the presence of your new target feature flag, and obviously the memcopy/memset size.

-Andrea

https://reviews.llvm.org/D32002