<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - VLD2 shuffle patterns gets broken apart by instcombine"
   href="https://bugs.llvm.org/show_bug.cgi?id=47677">47677</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>VLD2 shuffle patterns gets broken apart by instcombine
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Windows NT
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Scalar Optimizations
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>david.green@arm.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Give code that the vectorizer chooses to create a VLD2 interleaving group for:
<a href="https://godbolt.org/z/P6E88r">https://godbolt.org/z/P6E88r</a>

We generate a loop that does:
  ; VLD2 group
  %wide.vec = load <8 x float>, <8 x float>* %2, align 4, !tbaa !11
  %strided.vec = shufflevector <8 x float> %wide.vec, <8 x float> undef, <4 x
i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec19 = shufflevector <8 x float> %wide.vec, <8 x float> undef, <4 x
i32> <i32 1, i32 3, i32 5, i32 7>
  ; Operations
  %3 = fmul fast <4 x float> %strided.vec, %strided.vec
  %4 = fmul fast <4 x float> %strided.vec19, %strided.vec19
  %5 = fadd fast <4 x float> %4, %3
  ; store
  store <4 x float> %5, <4 x float>* %7, align 4, !tbaa !11


Currently in instcombine's foldVectorBinop, it will sink the shuffle's past the
fmul (as each side has an equal shuffle mask). This means we end up with
regular non-interleaving loads and potentially expensive shuffles in the middle
of the fmul and fadd.

On NEON for aarch64 and arm this will create zip instructions. For MVE where
the zip/unz are not present we fall back to even more expensive registry moved.

Similar things can happen in other cases:
<a href="https://godbolt.org/z/WKnPE9">https://godbolt.org/z/WKnPE9</a></pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>