<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - [X86] Unnecessary stack manipulation remaining from memcpy/extract_subvector"
   href="https://bugs.llvm.org/show_bug.cgi?id=42794">42794</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>[X86] Unnecessary stack manipulation remaining from memcpy/extract_subvector
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Windows NT
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: X86
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>llvm-dev@redking.me.uk
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>bisqwit@iki.fi, craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Split off from [<a class="bz_bug_link 
          bz_status_REOPENED "
   title="REOPENED - [X86][SSE] Missed vXi8 sum reduction optimization"
   href="show_bug.cgi?id=42674">Bug #42674</a>]

Current codegen for 16/32/64 byte sums: <a href="https://godbolt.org/z/76g4k2">https://godbolt.org/z/76g4k2</a>

The 32-byte reduction below should have been able to remove memcpy's stack
spill/reload and use the original load directly.

64-byte reduction manages to do this (see Godbolt) but leaves the rsp stack
manipulations.

    #include <string.h>
    unsigned char calculate_checksum(const void* ptr)
    {
        unsigned char bytes[32], result = 0;
        memcpy(bytes, ptr, 32); // Endianess does not matter.
        for(unsigned n=0; n<32; ++n) result += bytes[n];
        return result;
    }


        push    rbp
        mov     rbp, rsp
        and     rsp, -32
        sub     rsp, 64
        vmovups ymm0, ymmword ptr [rdi]
        vmovaps ymmword ptr [rsp], ymm0
        vmovdqa xmm0, xmmword ptr [rsp]
        vpaddb  xmm0, xmm0, xmmword ptr [rsp + 16]
        vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
        vpaddb  xmm0, xmm0, xmm1
        vpxor   xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vpextrb eax, xmm0, 0
        mov     rsp, rbp
        pop     rbp
        vzeroupper
        ret</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>