<html>
    <head>
      <base href="https://llvm.org/bugs/" />
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW --- - PMULLD should be avoided if possible on Silvermont"
   href="https://llvm.org/bugs/show_bug.cgi?id=31202">31202</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>PMULLD should be avoided if possible on Silvermont
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Windows NT
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: X86
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>zvi.rackover@intel.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr>

        <tr>
          <th>Classification</th>
          <td>Unclassified
          </td>
        </tr></table>
      <p>
        <div>
        <pre>For the following case:
define <4 x i32> @foo(<4 x i8> %A) {
  %z = zext <4 x i8> %A to <4 x i32>
  %m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>
  ret <4 x i32> %m
}

The following code is generated for Silvermont:
  pand    .LCPI1_0, %xmm0
  pmulld  .LCPI1_1, %xmm0
  retl

On Silvermont:
PMULLD has a throughput of 1/11 [instruction/cycles].
PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles].

Note that the multiplicands fit in 16-bits.

We would achieve a higher throughput with the following sequence:
  pshufb
  pmullw
  pmulhw
  punpcklwd

This issue was root caused by Farhana Aleen during analysis on internal
workloads which would regress if interleaving would be enabled for Silvermont
in X86TTI (so commit 284779 did not enable interleaving for some subtargets).
It turns out that with interleaving the vectorized IR prior to codegen is
decent for the chosen vectorization width. The issue reported here is one of
the major reasons for the slow-down (but fixing this issue alone only reduces
the regression).</pre>
        </div>
      </p>
      <hr>
      <span>You are receiving this mail because:</span>
      
      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>