[PATCH] D69044: [X86] Allow up to 4 loads per inline memcmp()

Fri Oct 18 10:58:05 PDT 2019

spatel added a comment.

In D69044#1714609 <https://reviews.llvm.org/D69044#1714609>, @davezarzycki wrote:

> In D69044#1714314 <https://reviews.llvm.org/D69044#1714314>, @spatel wrote:
>
> > Can we make the x86 change to combineVectorSizedSetCCEquality() independently and before the change to TargetLowering?
> >
> > Given that the previous attempt was reverted because of perf only it would be good to show some perf data here in the proposal. Micro-benchmark or more substantial. cc'ing @courbet in case there's already a test harness in place for that.
> >
> > I haven't looked at this in a while, so I wonder if we now have the infrastructure within memcmp expansion to create the partial vector code with 'ptest' shown here:
> >  https://bugs.llvm.org/show_bug.cgi?id=33914
>
>
> Hi @spatel,
>
> Thanks for the feedback and yes we can separate the changes. A few thoughts:
>
> 1. The inlined memcmp is much smarter than the Glibc memcmp code these days, at least for pure equality comparisons. In particular, the compiler's overlapping load optimization is really nice (see D55263 <https://reviews.llvm.org/D55263>).
> 2. This change proposal intentionally sidesteps more complex memcmps where the return result is tristate (greater, lesser, or equal), not binary (equal versus not). The tristate memcmp is what regressed in PR33914.
> 3. It would be easy to contrive a microbenchmark that makes any libc memcmp look very bad and the inlined memcmp look very good. This would be fun, but not informative or actionable.
> 4. If I were to design a somewhat interesting and quasi-realistic micro-benchmark, I might create a carefully crafted test that hammers on a `llvm::StringSwitch` where the cases need more than two load pairs to be inlined.
>
>   This all being said, and if I might be totally honest, I'd like to observe two things:
> 5. The size of inlined memcmps tend to have a log-normal distribution with a small mean/median/variance. In other words, the vast majority of inlined memcmps (especially on AVX or AVX512 CPUs) don't need more than two load pairs.
> 6. That being said, the specialization that a higher max load pair count allows matters more on simpler CPUs with smaller vectors (if at all) and fewer micro-architectural tricks to mask the cost of libc's dynamic dispatch.
>
>   Therefore, I would argue that the max load pair count should be derived, not fixed. For example, I think the following psuedo-code would yield reasonable results across the semi-recent history of Intel's product line: `2 * CACHELINE_SIZE / PREFERRED_VECTOR_SIZE`
>
>   I'd further argue that the compiler shouldn't assume that "max load pairs per block" being less than "max load pairs" is predictable by the branch predictor, but that's a separate discussion.
>
>   Your thoughts would be appreciated. Thanks!

I agree that a derived setting would be better than hard-coding. Exactly what that formula should be, I don't know...
Lazy question (can't tell from the test diffs): are we ignoring -mprefer-vector-width in these expansions? If so, we're almost certainly going to create fallout.
There's a lot to unravel for memcmp, so let's break this down into possible patches/steps:

1. Add a bunch of x86 tests (all of the new tests here can be committed with baseline codegen, so we'll just see diffs as we enable more expansions). Include some -prefer-vector-width override RUNs.
2. Enhance the x86 lowering.
3. Change/adjust the MaxLoads settings (probably fine to start as shown here currently with a simple tweak of the hardcoded value).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69044/new/

https://reviews.llvm.org/D69044