<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - PMULLD should be avoided if possible on Silvermont"

   href="https://llvm.org/bugs/show_bug.cgi?id=31202">31202</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>PMULLD should be avoided if possible on Silvermont

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>zvi.rackover@intel.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>For the following case:

define <4 x i32> @foo(<4 x i8> %A) {

  %z = zext <4 x i8> %A to <4 x i32>

  %m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>

  ret <4 x i32> %m

}

The following code is generated for Silvermont:

  pand    .LCPI1_0, %xmm0

  pmulld  .LCPI1_1, %xmm0

  retl

On Silvermont:

PMULLD has a throughput of 1/11 [instruction/cycles].

PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles].

Note that the multiplicands fit in 16-bits.

We would achieve a higher throughput with the following sequence:

  pshufb

  pmullw

  pmulhw

  punpcklwd

This issue was root caused by Farhana Aleen during analysis on internal

workloads which would regress if interleaving would be enabled for Silvermont

in X86TTI (so commit 284779 did not enable interleaving for some subtargets).

It turns out that with interleaving the vectorized IR prior to codegen is

decent for the chosen vectorization width. The issue reported here is one of

the major reasons for the slow-down (but fixing this issue alone only reduces

the regression).</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>