[PATCH] D36059: [memops] Add a new pass to inject fast-path code for specific library function calls.

Wed Aug 2 03:52:53 PDT 2017

chandlerc marked an inline comment as done.
chandlerc added a comment.

I have size data now.

Across the test suite + SPEC the total size increase with this patch is under 1%. Looking at benchmarks which exhibit the largest size growth, most grow by a few hundered bytes or less, they just happen to be *tiny* benchmarks.

The most interesting growth I see are;

473.astar - 5% growth, but this is still under 2k growth, in absolute terms this benchmark is quite small.
mafft/pairlocalalign - 3.3% (+14k)
447.dealII - 2.1% (+50k)

Everything else is small either in percent, absolute size, or both.

Across our internal benchmarks, I see no regressions with this patch but I see some benchmarks with 30% and 40% improvements (no, those numbers aren't mistakes). The pattern I am seeing is that when this matters, it *MATTERS*. But most of the time, the libcall is fast enough. This still seems very worthwhile to me as the code patterns that end up impacted by this seem imminently reasonable.

So generally, I think this is a pretty clear net win, it is fairly isolated, and the code size cost seems very low. Any concerns with moving forward here?

In https://reviews.llvm.org/D36059#827088, @aemerson wrote:

> Instead of generating loop IR for the fast path, how about creating a versioned memcpy/memset with the constrained parameters guarded under the condition test? That way, in the back-end the exact preferred optimal code can be generated, allowing for unrolled loop bodies specific to individual targets.

IMO, there is no need for doing this in this place. If we're just leaving a marker here for the target to expand, we don't need to do anything. We already get a chance to custom expand the libcall in the target. Adding the versioning doesn't make that any simpler given that it still needs to introduce a loop. If, for a particular target, it is worth emitting a versioned, carefully target-crafted loop or instruction sequence, I would expect them to not use this pass but to custom lower the calls in the backend much like x86 does for constant-size calls.

At least for x86 on Linux, I have no cases where something more complex than this trivial loop is a win compared to calling the library function.

https://reviews.llvm.org/D36059