<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Convert mov and shr to shrx in loops constrained by retirement rate"
   href="https://bugs.llvm.org/show_bug.cgi?id=51288">51288</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Convert mov and shr to shrx in loops constrained by retirement rate
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>new-bugs
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>12.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>new bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>todd@lipcon.org
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>This input file:

#include <stdint.h>
#include <utility>

struct Foo {
  uint64_t v;
  std::pair<uint32_t, uint32_t> Get() { return {v & 0xffffffff, v >> 32}; }
};

void Process(Foo* f, uint32_t* dst, int n) {
#pragma unroll
  for (int i = 0; i < n; i++) {
    auto [mask, idx] = f[i].Get();
    dst[idx] |= mask;
  }
}

Generates some assembly where the core of the loop has the following sequence:
        movq    24(%rdi,%rax,8), %r9
        movq    %r9, %rcx
        shrq    $32, %rcx
        orl     %r9d, (%rsi,%rcx,4)

When compiling with bmi2 support, it would instead be slightly faster to store
the constant 32 into a register and use shrx to combine the copy of %r9 into
%rcx with a shift.

Generated version:
<a href="https://bit.ly/2WzH8Pj">https://bit.ly/2WzH8Pj</a>

Preferred version (~saving half a cycle per unrolled-by-4 loop):
<a href="https://bit.ly/3jaXBBh">https://bit.ly/3jaXBBh</a></pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>