<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [aarch64] Inappropriate optimization: vtstq NEON intrinsic compiled as a sequence of instructions"

   href="https://bugs.llvm.org/show_bug.cgi?id=52394">52394</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[aarch64] Inappropriate optimization: vtstq NEON intrinsic compiled as a sequence of instructions

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>roman.zelenyi@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>In some cases clang compiles vtstq intrinsic as a sequence of and/cmeq

instructions, instead of just a single cmtst. 

For example:

#include <arm_neon.h>

uint32x4_t foo(uint32x4_t v1, uint32x4_t v2, uint32x4_t v3, uint32x4_t v4)

{

    return vbslq_u32(vtstq_u32(v1, v2), v3, v4);

}

compiles (with -O2 or -Os or even -Oz) to:

        and     v0.16b, v1.16b, v0.16b

        cmeq    v0.4s, v0.4s, #0

        bsl     v0.16b, v3.16b, v2.16b

        ret

The reason for this creativity is unclear - AFAIK, cmtst throughput/latency is

similar to cmeq. 

Anyways, my benchkmarks indicate significant performance degradation for this

reason. The benchmarked case is an unrolled loop mostly comprised of vbslq and

vtstq).

Both GCC and MSVC compile the code above as expected:

        cmtst   v0.4s, v0.4s, v1.4s

        bsl     v0.16b, v2.16b, v3.16b

        ret</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>