[PATCH] D69044: [X86] Allow up to 4 loads per inline memcmp()

Wed Oct 30 06:23:10 PDT 2019

davezarzycki added a comment.

Some testing results.

I built llvm+clang twice, both with core2 as the target CPU. Once without this change and once with this change. I verified that the 4-load-pair clang assembly to see that at least some memcmps generated three or more XMM load-pairs. That being said, more than two load XMM pairs was uncommon. I then ran `perf stat` against clang while it compiled X86ISelLowering.cpp (which takes about 37 seconds on my Xeon 8168 with turbo disabled).

In terms of "wall clock" performance, allowing up to four load pairs is lost in the noise. (At best, there might be a 0.082% difference.) The 2-load-pair clang required 0.027% more instructions to execute versus the 4-load-pair clang, and almost 0.03% more branches. Both of these seem given the dynamic overhead of Libc's memcmp().

Separably, I started writing a microbenchmark that used `llvm::StringSwitch` but it didn't feel right. Two (potentially overlapping) XMM registers can cover all values up to 32 bytes. That's big enough for the majority of real world scenarios.

Overall, I've changed my mind about this proposal. I think the time and place for 4 (or more) load pairs was in the pre-vector (and therefore pre-64-bit) era, where going from 2 scalar load pairs to 4 scalar load pairs was a bigger win because the load sizes were so tiny.

I suppose we could enable four load pairs on pre-SSE machines if people care. Otherwise and unless there objections, I'll close this proposal in a few days.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69044/new/

https://reviews.llvm.org/D69044