<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Byte+shift loads are not autovectorized to movdqu on SSE4.1"

   href="https://bugs.llvm.org/show_bug.cgi?id=42550">42550</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Byte+shift loads are not autovectorized to movdqu on SSE4.1

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>husseydevin@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Take the following code:

#ifdef BYTE_SHIFT

static uint32_t read32(uint8_t const *data, size_t offset)

{

    return (uint32_t) data[offset + 0]

        | ((uint32_t) data[offset + 1] << 8)

        | ((uint32_t) data[offset + 2] << 16)

        | ((uint32_t) data[offset + 3] << 24);

}

#else

static uint32_t read32(uint8_t const *data, size_t offset)

{

    uint32_t ret;

    memcpy(&ret, data + offset, sizeof(float));

    return ret;

}

#endif

Both ways are perfectly valid ways to perform an unaligned load, and when SSE

is disabled, they generate the same code when used. 

However, when SSE4 is enabled and loops using these loads are autovectorized,

instead of a movdqu which is what memcpy outputs, the byte shift is expanded

literally into a bunch of pslld, pinsrb, and pmovzxbd instructions.

Demo: <a href="https://godbolt.org/z/jCAm2o">https://godbolt.org/z/jCAm2o</a>

These types of loads should be converted to movdqu as well.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>