[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Chandler Carruth via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Jul 21 17:47:01 PDT 2017


chandlerc created this revision.
Herald added subscribers: fhahn, mcrosier, sanjoy.

To understand the motivation of this patch, it is important to consider
that LLVM is remarkably diligent and effective at converting user loops
into memset and memcpy intrinsics. These in fact show up frequently
inside of deeply nested loops, etc. However, when it emits these as
calls to the actual memset and memcpy library functions the cost of
issuing this call can in many cases far outstrip the cost of actually
doing the operation and negatively impact surrounding code. Our analysis
of some benchmarks which hit this shows this comes from a few places:

1. Calling these library functions requires setting up registers for the calling convention and ends up in practice forcing a surprising number of register reloads if they occur inside of loops.
2. When using PIC, the call is much more expensive due to the PLT-call pattern, requiring at best a double indirect jump on Linux and BSD systems.

For older x86 processors this was unavoidable. But modern processors
provide very fast instruction pattern support for implementing these
library functions in many (if not quite all) cases. Starting with
Ivybridge, there seems to be no point in using the library functions
with well aligned buffers (alignment of 16-bytes or better), and even
starting with Nehalem, they seem superior to PLT library function calls.

It is also possible to carefully fold size scaling into these sequences
which helps avoid generating extra scaling code when we are in fact
emitting code for user loops that were written at 4-byte or 8-byte
granularity.

Naturally, this is a pretty significant change. I'm still running
benchmarks on various architectures to confirm that this direction makes
sense, but more insight from Intel and other x86 hardware experts would
be really welcome here to make sure we're picking reasonable tradeoffs.
I'm starting here with a very aggressive version of the patch so I can
find where it *does* regress, and we can back off until it looks
reasonable.

Given that Sandybridge is now over 4 years old and the growing density
of Ivybridge or newer processors in the world, I think it may be
reasonable to be somewhat aggressive in this lowering even if the
performance on older processors isn't ideal.

One interesting question is whether rep+movs{w,d,q} and rep+stos{w,d,q}
are as well tuned as rep+movsb and rep+stosb and the descaled versions
are actually reasonable. Craig has indicated they may not be, and I'm hoping to
confirm one way or the other when benchmarking.

The test case added also exposes some annoying problems with codegen of
these instructions that should also be addressed.

Last but not least, in many cases the pattern to match the scaling here
will not fire because currently LLVM has a bad bug that causes it to
much more often than necessary scale using a more complex pattern of
math. I'm going to work to address that in a separate patch as it
appears to be a middle-end issue.


https://reviews.llvm.org/D35750

Files:
  lib/Target/X86/X86.td
  lib/Target/X86/X86SelectionDAGInfo.cpp
  lib/Target/X86/X86Subtarget.cpp
  lib/Target/X86/X86Subtarget.h
  test/CodeGen/X86/mem_lowering.ll
  test/CodeGen/X86/memcpy-struct-by-value.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D35750.107761.patch
Type: text/x-patch
Size: 80501 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170722/396d3c2d/attachment.bin>


More information about the llvm-commits mailing list