<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - unnecessary 8-bit partial-register usage creates false dependencies."

   href="https://bugs.llvm.org/show_bug.cgi?id=34707">34707</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>unnecessary 8-bit partial-register usage creates false dependencies.

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>unsigned long bzhi_l(unsigned long x, unsigned c) {

    return x & ((1UL << c) - 1);

}

// <a href="https://godbolt.org/g/sBEyfd">https://godbolt.org/g/sBEyfd</a>

clang 6.0.0 (trunk 313965) -xc -O3 -march=haswell -m32    or znver1

        movb    8(%esp), %al

        bzhil   %eax, 4(%esp), %eax

        retl

This is technically correct (because BZHI only looks at the low 8 bits of

src2), but horrible.  There is *no* advantage to using an 8-bit load here

instead of a 32-bit load.  Same code size, but creates a false dependency on

the old value of rax.

(znver1 definitely doesn't rename partial registers.  Intel Haswell/Skylake

don't rename low8 registers separately from the full register, unlike

Sandybridge or Core2/Nehalem. 

<a href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to</a>).

On Haswell and Skylake, movb  8(%esp), %al  runs at 1 per cycle, as a

micro-fused ALU+load uop.  An occasional dep-breaking xor %eax,%eax lets it

bottleneck on 2 loads per clock.

Clang seems to be very eager to only move 8 bits instead of the full register. 

Clang 3.9 fixed this for reg-reg moves (e.g. unsigned shift(unsigned x,

unsigned c) {  return x<<c; }   without BMI2), but we're still getting 8-bit

loads.  On Intel CPUs, MOVZX loads are cheaper than narrow MOV loads because

they avoid the ALU uop to merge into the destination.  (It does take an extra

code byte).  AMD CPUs may use an ALU port for MOVZX, but Intel handles it

purely in the load ports.

But anyway, when loading from 32-bit memory location, it makes no sense to load

only the low 8 bits, unless we have reason to expect it was written with

separate byte stores and we want to avoid a store-forwarding stall.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>