<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [SSE, AVX]: doesn't use packuswb with 2 different registers"

   href="https://bugs.llvm.org/show_bug.cgi?id=34773">34773</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[SSE, AVX]: doesn't use packuswb with 2 different registers

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t

*__restrict__ src, size_t bytes) {

  uint8_t *end_dst = dst + bytes;

  do{

     *dst++ = *src++ >> 8;

  } while(dst < end_dst);

}

<a href="https://godbolt.org/g/RoiFnW">https://godbolt.org/g/RoiFnW</a>

clang 6.0.0 (trunk 314277) -O3

.LBB0_4:                                # =>This Inner Loop Header: Depth=1

        movdqu  (%rsi,%rcx,2), %xmm0

        movdqu  16(%rsi,%rcx,2), %xmm1

        psrlw   $8, %xmm0

        psrlw   $8, %xmm1

        packuswb        %xmm0, %xmm0

        packuswb        %xmm1, %xmm1

        punpcklqdq      %xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0]

        movdqu  %xmm0, (%rdi,%rcx)

       (repeated again with +32 / +16 offsets)

        addq    $32, %rcx

        addq    $2, %rax

        jne     .LBB0_4

Those three shuffles can (and should) be a single packuswb %xmm1, %xmm0.

We can see from -fno-unroll-loops output that it thinks the base case is an

8-byte store.  And that it's only using the shift+pack as a stand-in for SSSE3

pshufb (which it uses when available.  e.g.  -mssse3 -mtune=skylake)

        # no shifts

        pshufb  %xmm0, %xmm1

        pshufb  %xmm0, %xmm2

        punpcklqdq      %xmm2, %xmm1    # xmm1 = xmm1[0],xmm2[0]

This sucks everywhere, but sucks the most on Haswell and later which only have

1 shuffle port.  Skylake even has 2 per clock shift throughput.  If not for the

front-end bottleneck, it can execute 2 shifts + 1 shuffle per clock.  (There's

no load+shift instruction in AVX2, only AVX512.  And BTW clang auto-vectorizes

well for AVX512, with vpsrlw  $8, (%rsi,%rcx,2), %zmm0  /  vpmovwb %zmm0,

(%rdi,%rcx) which is probably optimal on Skylake-avx512.

I think the optimal strategy without AVX512 is to replace one shift with an AND

+ unaligned load offset by -1.  Especially with AVX, that lets the load+ALU

fold into one instruction.  (see <a href="https://stackoverflow.com/a/46477080/224132">https://stackoverflow.com/a/46477080/224132</a>

for details.  With src 32B-aligned, this will never cache-line split, and

should be good on all CPUs back to Nehalem or K10, compiled with or without

-mavx)

       // uint8_t *dst, *src;

     __m128i v0 = _mm_loadu_si128((__m128i*)src);

     __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));

     v0 = _mm_srli_epi16(v0, 8);

     __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));

     __m128i pack = _mm_packus_epi16(v0, v1);

     _mm_storeu_si128((__m128i*)dst, pack);

This works for AVX2 256b vectors with one extra  VPERMQ  at the end to fix-up

the in-lane vpackuswb behaviour (like gcc emits when auto-vectorizing).

clang's AVX2 auto-vectorization is *horrible* here: With  -fno-unroll-loops:

        vmovdqu (%rsi,%rax,2), %ymm1

        vpsrlw  $8, %ymm1, %ymm1

        vextracti128    $1, %ymm1, %xmm2

        vpshufb %xmm0, %xmm2, %xmm2

        vpshufb %xmm0, %xmm1, %xmm1

        vpunpcklqdq     %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]

        vmovdqu %xmm1, (%rdi,%rax)

With default loop unrolling, it's basically that but with vinserti128 to feed

256b stores.  Because bottlenecking even harder on the shuffle port seems like

a great idea (with -march=skylake)...

(If this isn't a symptom of the same bug as the SSE2 sillyness, I guess this

part should be reported separately.)</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>