[PATCH] D69044: [X86] Allow up to 4 loads per inline memcmp()

Fri Oct 18 08:47:51 PDT 2019

davezarzycki added a comment.

In D69044#1714314 <https://reviews.llvm.org/D69044#1714314>, @spatel wrote:

> Can we make the x86 change to combineVectorSizedSetCCEquality() independently and before the change to TargetLowering?
>
> Given that the previous attempt was reverted because of perf only it would be good to show some perf data here in the proposal. Micro-benchmark or more substantial. cc'ing @courbet in case there's already a test harness in place for that.
>
> I haven't looked at this in a while, so I wonder if we now have the infrastructure within memcmp expansion to create the partial vector code with 'ptest' shown here:
>  https://bugs.llvm.org/show_bug.cgi?id=33914

Hi @spatel,

Thanks for the feedback and yes we can separate the changes. A few thoughts:

1. The inlined memcmp is much smarter than the Glibc memcmp code these days, at least for pure equality comparisons. In particular, the compiler's overlapping load optimization is really nice (see D55263 <https://reviews.llvm.org/D55263>).
2. This change proposal intentionally sidesteps more complex memcmps where the return result is tristate (greater, lesser, or equal), not binary (equal versus not). The tristate memcmp is what regressed in PR33914.
3. It would be easy to contrive a microbenchmark that makes any libc memcmp look very bad and the inlined memcmp look very good. This would be fun, but not informative or actionable.
4. If I were to design a somewhat interesting and quasi-realistic micro-benchmark, I might create a carefully crafted test that hammers on a `llvm::StringSwitch` where the cases need more than two load pairs to be inlined.

This all being said, and if I might be totally honest, I'd like to observe two things:

1. The size of inlined memcmps tend to have a log-normal distribution with a small mean/median/variance. In other words, the vast majority of inlined memcmps (especially on AVX or AVX512 CPUs) don't need more than two load pairs.
2. That being said, the specialization that a higher max load pair count allows matters more on simpler CPUs with smaller vectors (if at all) and fewer micro-architectural tricks to mask the cost of libc's dynamic dispatch.

Therefore, I would argue that the max load pair count should be derived, not fixed. For example, I think the following psuedo-code would yield reasonable results across the semi-recent history of Intel's product line: `2 * CACHELINE_SIZE / PREFERRED_VECTOR_SIZE`

I'd further argue that the compiler shouldn't assume that "max load pairs per block" being less than "max load pairs" is predictable by the branch predictor, but that's a separate discussion.

Your thoughts would be appreciated. Thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69044/new/

https://reviews.llvm.org/D69044