<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift"

   href="https://bugs.llvm.org/show_bug.cgi?id=35047">35047</a>

          </td>

        </tr>


        <tr>

          <th>Summary</th>

          <td>load merging for (data[0]<<0) | (data[1]<<8) | ... endian agnostic load goes berserk with AVX2 variable-shift

          </td>

        </tr>


        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>


        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>


        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>


        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>


        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>


        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>


        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>


        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>


        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>


        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>


        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>


        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>unsigned load_le32(unsigned char *data) {

    unsigned le32 = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) |

(data[3]<<24);

    return le32;

}


// <a href="https://godbolt.org/g/X8i1pr">https://godbolt.org/g/X8i1pr</a>


clang 6.0.0 (trunk 316311) -O3 -march=haswell -mno-avx


        movl    (%rdi), %eax

        retq


-O3 -march=haswell (with AVX2)


.LCPI0_0:

        .quad   16                      # 0x10

        .quad   24                      # 0x18

load_le32:                              # @load_le32

        movzbl  (%rdi), %eax

        movzbl  1(%rdi), %ecx

        shll    $8, %ecx

        vpmovzxbq       2(%rdi), %xmm0  # xmm0 =

mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero

        orl     %eax, %ecx

        vpsllvq .LCPI0_0(%rip), %xmm0, %xmm0

        vmovd   %xmm0, %edx

        vpextrd $2, %xmm0, %eax

        orl     %edx, %eax

        orl     %ecx, %eax

        retq


So if vpsllvq is available, clang uses it and doesn't notice that it could have

coalesced the loads into one.  -fno-vectorize doesn't block this.  (And if the

shift counts didn't line up this way, it's quite poorly vectorized.  VPMOVZXBD

would have worked, then do 4 shifts, and then a horizontal reduction with OR,

using the same pattern as a horizontal sum.  e.g. vpunpckhqdq / vpor / vmovq /

rorx $32, %rax, %rdx / or %edx, %eax)


(And BTW, for Haswell and later,  movb 1(%rdi), %al  merges into RAX without

stalling at all.  It's a single micro-fused load+merge uop, so it's better than

a separate movzx load + OR instruction.  See  

<a href="https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to">https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to</a>)


clang 4.0.1 doesn't merge the loads.</pre>

        </div>

      </p>


      <hr>

      <span>You are receiving this mail because:</span>


      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>