<body><table border="1" cellspacing="0" cellpadding="8">
title="NEW - Overlap/predicate vectorization loops/splits to reduce unaligned memory access"
<td>Overlap/predicate vectorization loops/splits to reduce unaligned memory access
<td>firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
<pre>Pulled out of <a href="https://reviews.llvm.org/D111029">https://reviews.llvm.org/D111029</a> where we were discussing full
512-bit vectorization on x86 CPUs that should benefit from it, in particular
when predicate instructions are available:
Also forgot to mention, 64-byte vectors are more sensitive to alignment, even
when data isn't hot in L1d cache. e.g. loops over data coming from DRAM or
maybe L3 are about 15% to 20% slower with misaligned loads IIRC, vs. only a
couple % for AVX2. At least this was the case on Skylake-SP; IDK about client
chips with AVX-512.
So the usual optimistic strategy of using unaligned loads but not spending any
extra instructions to reach an alignment boundary might not be the best choice
for some loops with 512-bit vectors.
Going scalar until an alignment boundary is pretty terrible, especially for
"vertical" operations like a[i] *= 3.0 or something that means it's ok to
process the same element twice, as long as any reads are before any potentially
overlapping stores. e.g.
load a first vector
round the pointer up to the next alignment boundary with add reg, 64 / and
load the first-iteration loop vector (peeled from first iteration)
store the first (unaligned) vector
enter a loop that ends on a pointer-compare condition.
cleanup that starts with the final aligned vector loaded and processed but
not stored yet
If the array already was aligned, there's no overlap. For short arrays, AVX-512
masking can be used to avoid reading or writing past the end, generating masks
on the fly with shlx or shrx.
Anyway, this is obviously much better than going scalar until an alignment
boundary, in loops where we can sort out aliasing sufficiently, and where
there's only one pointer to worry about so relative misalignment isn't a
factor. In many non-reductions, there are at least pointers so it may not be
possible to align both.
An efficient alignment strategy like this might help make vector width = 512
worth it for more code which doesn't take care to align its arrays. Clearly
that should be a separate feature-request / proposal if there isn't one open
for that already; IDK how hard it would be to teach LLVM (or GCC) that an
overlapping vectors strategy can be good, or if it's just something that
nobody's pointed out before.
Vector ISAs like ARM SVE and I think RISC-V's planned one have good HW support
for generating masks from pointers and stuff like that, but it can be done
manually especially in AVX-512 with mask registers.</pre>
<span>You are receiving this mail because:</span>
<li>You are on the CC list for the bug.</li>