[llvm-bugs] [Bug 52348] New: Overlap/predicate vectorization loops/splits to reduce unaligned memory access

Fri Oct 29 05:41:16 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=52348

            Bug ID: 52348
           Summary: Overlap/predicate vectorization loops/splits to reduce
                    unaligned memory access
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: llvm-dev at redking.me.uk
                CC: a.bataev at hotmail.com, florian_hahn at apple.com,
                    lebedev.ri at gmail.com, llvm-bugs at lists.llvm.org,
                    pengfei.wang at intel.com, peter at cordes.ca,
                    spatel+llvm at rotateright.com

Pulled out of https://reviews.llvm.org/D111029 where we were discussing full
512-bit vectorization on x86 CPUs that should benefit from it, in particular
when predicate instructions are available:

-------

Also forgot to mention, 64-byte vectors are more sensitive to alignment, even
when data isn't hot in L1d cache. e.g. loops over data coming from DRAM or
maybe L3 are about 15% to 20% slower with misaligned loads IIRC, vs. only a
couple % for AVX2. At least this was the case on Skylake-SP; IDK about client
chips with AVX-512.

So the usual optimistic strategy of using unaligned loads but not spending any
extra instructions to reach an alignment boundary might not be the best choice
for some loops with 512-bit vectors.

Going scalar until an alignment boundary is pretty terrible, especially for
"vertical" operations like a[i] *= 3.0 or something that means it's ok to
process the same element twice, as long as any reads are before any potentially
overlapping stores. e.g.

    load a first vector
    round the pointer up to the next alignment boundary with add reg, 64 / and
reg, -64
    load the first-iteration loop vector (peeled from first iteration)
    store the first (unaligned) vector
    enter a loop that ends on a pointer-compare condition.
    cleanup that starts with the final aligned vector loaded and processed but
not stored yet

If the array already was aligned, there's no overlap. For short arrays, AVX-512
masking can be used to avoid reading or writing past the end, generating masks
on the fly with shlx or shrx.

Anyway, this is obviously much better than going scalar until an alignment
boundary, in loops where we can sort out aliasing sufficiently, and where
there's only one pointer to worry about so relative misalignment isn't a
factor. In many non-reductions, there are at least pointers so it may not be
possible to align both.

An efficient alignment strategy like this might help make vector width = 512
worth it for more code which doesn't take care to align its arrays. Clearly
that should be a separate feature-request / proposal if there isn't one open
for that already; IDK how hard it would be to teach LLVM (or GCC) that an
overlapping vectors strategy can be good, or if it's just something that
nobody's pointed out before.

Vector ISAs like ARM SVE and I think RISC-V's planned one have good HW support
for generating masks from pointers and stuff like that, but it can be done
manually especially in AVX-512 with mask registers.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20211029/687f4206/attachment-0001.html>