[PATCH] D32002: [X86] Improve large struct pass by value performance

Wed Apr 19 00:49:43 PDT 2017

courbet added a comment.

In https://reviews.llvm.org/D32002#729148, @andreadb wrote:

> > There are two sides to this flag:
> > 
> >   Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
> >   Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).
>
> I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

I've added comments on the flag definition.

> As a side note:
>  Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

I think so, I've only seen it mentioned in their manual, and I have not tried on AMD.

> Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

Yes, but I'd like to keep it separate because there's a bit more to be done there.

> I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not.

Thanks for the pointer, looks like what I was looking for.

================
Comment at: lib/Target/X86/X86.td:508
   FeatureBMI2,
+  FeatureFastString,
   FeatureFMA,
----------------
zvi wrote:
> courbet wrote:
> > RKSimon wrote:
> > > Is this a Haswell feature in particular or the only target that has been tested?
> > I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too, so I have not added it there again.
> The Optimization Guide section @craig.topper  quoted above states that this feature is available starting from Ivy Bridge.
Unfortunately I don't have an IvyBridge to measure it. Do we want to blindly trust the manual ? :)

https://reviews.llvm.org/D32002