<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - instcombine forms illegal <2 x i64> multiply with trunc/zext"

   href="https://bugs.llvm.org/show_bug.cgi?id=40032">40032</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>instcombine forms illegal <2 x i64> multiply with trunc/zext

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Scalar Optimizations

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>efriedma@codeaurora.org

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>husseydevin@gmail.com, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>C testcase:

#include <arm_neon.h>

typedef int64x2_t U64x2;

typedef int32x2_t U32x2;

typedef int32x4_t U32x4;

U64x2 f(U64x2 top, U64x2 bot) {

        U32x2 d2 = vmovn_u64(bot);

        U32x2 d5 = vmovn_u64(top);

        d5 = vmla_u32(d5, d2, d2);

        return vshll_n_u32(d5, 0);

}

On ARM, we currently end up scalarizing the multiply, instead of using the

vector operation written in the source.  (Even if we improve the expansion of

general <2 x i64 multiplies on ARM, it'll still be more expensive than the

original code. Other targets have similar issues.)

I think instcombine is being too aggressive here; I'm not sure the backend can

reasonably recover the original multiply in general.

IR testcase (for opt -instcombine):

define <2 x i64> @f(<2 x i64> %top, <2 x i64> %bot) {

entry:

  %vmovn.i = trunc <2 x i64> %bot to <2 x i32>

  %vmovn.i9 = trunc <2 x i64> %top to <2 x i32>

  %mul.i = mul <2 x i32> %vmovn.i, %vmovn.i

  %add.i = add <2 x i32> %mul.i, %vmovn.i9

  %0 = zext <2 x i32> %add.i to <2 x i64>

  ret <2 x i64> %0

}

Filing based on discussion in <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Clang generates slow U64x2 multiply code for NEON"

   href="show_bug.cgi?id=39967">bug 39967</a>.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>