[PATCH] D111029: [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336)

Tue Oct 12 13:34:19 PDT 2021

pcordes added a comment.

Also forgot to mention, 64-byte vectors are more sensitive to alignment, even when data isn't hot in L1d cache.  e.g. loops over data coming from DRAM or maybe L3 are about 15% to 20% slower with misaligned loads IIRC, vs. only a couple % for AVX2.  At least this was the case on Skylake-SP; IDK about client chips with AVX-512.

So the usual optimistic strategy of using unaligned loads but not spending any extra instructions to reach an alignment boundary might not be the best choice for some loops with 512-bit vectors.

Going scalar until an alignment boundary is pretty terrible, especially for "vertical" operations like `a[i] *= 3.0` or something that means it's ok to process the same element twice, as long as any reads are before any potentially overlapping stores.  e.g.

- load a first vector
- round the pointer up to the next alignment boundary with `add reg, 64` / `and reg, -64`
- load the first-iteration loop vector (peeled from first iteration)
- store the first (unaligned) vector
- enter a loop that ends on a pointer-compare condition.
- cleanup that starts with the final aligned vector loaded and processed but not stored yet

If the array already was aligned, there's no overlap.  For short arrays, AVX-512 masking can be used to avoid reading or writing past the end, generating masks on the fly with shlx or shrx.

Anyway, this is obviously much better than going scalar until an alignment boundary, in loops where we can sort out aliasing sufficiently, and where there's only one pointer to worry about so relative misalignment isn't a factor. In many non-reductions, there are at least pointers so it may not be possible to align both.

An efficient alignment strategy like this might help make vector width = 512 worth it for more code which doesn't take care to align its arrays.  Clearly that should be a separate feature-request / proposal if there isn't one open for that already; IDK how hard it would be to teach LLVM (or GCC) that an overlapping vectors strategy can be good, or if it's just something that nobody's pointed out before.

Vector ISAs like ARM SVE and I think RISC-V's planned one have good HW support for generating masks from pointers and stuff like that, but it can be done manually especially in AVX-512 with mask registers.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111029/new/

https://reviews.llvm.org/D111029