<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - VLD2 shuffle patterns gets broken apart by instcombine"

   href="https://bugs.llvm.org/show_bug.cgi?id=47677">47677</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>VLD2 shuffle patterns gets broken apart by instcombine

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>david.green@arm.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Give code that the vectorizer chooses to create a VLD2 interleaving group for:

<a href="https://godbolt.org/z/P6E88r">https://godbolt.org/z/P6E88r</a>

We generate a loop that does:

  ; VLD2 group

  %wide.vec = load <8 x float>, <8 x float>* %2, align 4, !tbaa !11

  %strided.vec = shufflevector <8 x float> %wide.vec, <8 x float> undef, <4 x

i32> <i32 0, i32 2, i32 4, i32 6>

  %strided.vec19 = shufflevector <8 x float> %wide.vec, <8 x float> undef, <4 x

i32> <i32 1, i32 3, i32 5, i32 7>

  ; Operations

  %3 = fmul fast <4 x float> %strided.vec, %strided.vec

  %4 = fmul fast <4 x float> %strided.vec19, %strided.vec19

  %5 = fadd fast <4 x float> %4, %3

  ; store

  store <4 x float> %5, <4 x float>* %7, align 4, !tbaa !11

Currently in instcombine's foldVectorBinop, it will sink the shuffle's past the

fmul (as each side has an equal shuffle mask). This means we end up with

regular non-interleaving loads and potentially expensive shuffles in the middle

of the fmul and fadd.

On NEON for aarch64 and arm this will create zip instructions. For MVE where

the zip/unz are not present we fall back to even more expensive registry moved.

Similar things can happen in other cases:

<a href="https://godbolt.org/z/WKnPE9">https://godbolt.org/z/WKnPE9</a></pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>