<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift"
   href="https://bugs.llvm.org/show_bug.cgi?id=35047">35047</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>new-bugs
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Keywords</th>
          <td>performance
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>new bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>peter@cordes.ca
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>unsigned load_le32(unsigned char *data) {
    unsigned le32 = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) |
(data[3]<<24);
    return le32;
}

// <a href="https://godbolt.org/g/X8i1pr">https://godbolt.org/g/X8i1pr</a>

clang 6.0.0 (trunk 316311) -O3 -march=haswell -mno-avx

        movl    (%rdi), %eax
        retq

-O3 -march=haswell (with AVX2)

.LCPI0_0:
        .quad   16                      # 0x10
        .quad   24                      # 0x18
load_le32:                              # @load_le32
        movzbl  (%rdi), %eax
        movzbl  1(%rdi), %ecx
        shll    $8, %ecx
        vpmovzxbq       2(%rdi), %xmm0  # xmm0 =
mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        orl     %eax, %ecx
        vpsllvq .LCPI0_0(%rip), %xmm0, %xmm0
        vmovd   %xmm0, %edx
        vpextrd $2, %xmm0, %eax
        orl     %edx, %eax
        orl     %ecx, %eax
        retq

So if vpsllvq is available, clang uses it and doesn't notice that it could have
coalesced the loads into one.  -fno-vectorize doesn't block this.  (And if the
shift counts didn't line up this way, it's quite poorly vectorized.  VPMOVZXBD
would have worked, then do 4 shifts, and then a horizontal reduction with OR,
using the same pattern as a horizontal sum.  e.g. vpunpckhqdq / vpor / vmovq /
rorx $32, %rax, %rdx / or %edx, %eax)

(And BTW, for Haswell and later,  movb 1(%rdi), %al  merges into RAX without
stalling at all.  It's a single micro-fused load+merge uop, so it's better than
a separate movzx load + OR instruction.  See  
<a href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to</a>)


clang 4.0.1 doesn't merge the loads.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>