<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Overlap/predicate vectorization loops/splits to reduce unaligned memory access"

   href="https://bugs.llvm.org/show_bug.cgi?id=52348">52348</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Overlap/predicate vectorization loops/splits to reduce unaligned memory access

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>llvm-dev@redking.me.uk

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>a.bataev@hotmail.com, florian_hahn@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, pengfei.wang@intel.com, peter@cordes.ca, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Pulled out of <a href="https://reviews.llvm.org/D111029">https://reviews.llvm.org/D111029</a> where we were discussing full

512-bit vectorization on x86 CPUs that should benefit from it, in particular

when predicate instructions are available:

-------

Also forgot to mention, 64-byte vectors are more sensitive to alignment, even

when data isn't hot in L1d cache. e.g. loops over data coming from DRAM or

maybe L3 are about 15% to 20% slower with misaligned loads IIRC, vs. only a

couple % for AVX2. At least this was the case on Skylake-SP; IDK about client

chips with AVX-512.

So the usual optimistic strategy of using unaligned loads but not spending any

extra instructions to reach an alignment boundary might not be the best choice

for some loops with 512-bit vectors.

Going scalar until an alignment boundary is pretty terrible, especially for

"vertical" operations like a[i] *= 3.0 or something that means it's ok to

process the same element twice, as long as any reads are before any potentially

overlapping stores. e.g.

    load a first vector

    round the pointer up to the next alignment boundary with add reg, 64 / and

reg, -64

    load the first-iteration loop vector (peeled from first iteration)

    store the first (unaligned) vector

    enter a loop that ends on a pointer-compare condition.

    cleanup that starts with the final aligned vector loaded and processed but

not stored yet

If the array already was aligned, there's no overlap. For short arrays, AVX-512

masking can be used to avoid reading or writing past the end, generating masks

on the fly with shlx or shrx.

Anyway, this is obviously much better than going scalar until an alignment

boundary, in loops where we can sort out aliasing sufficiently, and where

there's only one pointer to worry about so relative misalignment isn't a

factor. In many non-reductions, there are at least pointers so it may not be

possible to align both.

An efficient alignment strategy like this might help make vector width = 512

worth it for more code which doesn't take care to align its arrays. Clearly

that should be a separate feature-request / proposal if there isn't one open

for that already; IDK how hard it would be to teach LLVM (or GCC) that an

overlapping vectors strategy can be good, or if it's just something that

nobody's pointed out before.

Vector ISAs like ARM SVE and I think RISC-V's planned one have good HW support

for generating masks from pointers and stuff like that, but it can be done

manually especially in AVX-512 with mask registers.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>