void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);

clang 6.0.0 (trunk 314277) -O3

.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        movdqu  (%rsi,%rcx,2), %xmm0
        movdqu  16(%rsi,%rcx,2), %xmm1
        psrlw   $8, %xmm0
        psrlw   $8, %xmm1
        packuswb        %xmm0, %xmm0
        packuswb        %xmm1, %xmm1
        punpcklqdq      %xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0]
        movdqu  %xmm0, (%rdi,%rcx)
       (repeated again with +32 / +16 offsets)
        addq    $32, %rcx
        addq    $2, %rax
        jne     .LBB0_4

Those three shuffles can (and should) be a single packuswb %xmm1, %xmm0.

We can see from -fno-unroll-loops output that it thinks the base case is an
8-byte store.  And that it's only using the shift+pack as a stand-in for SSSE3
pshufb (which it uses when available.  e.g.  -mssse3 -mtune=skylake)

        # no shifts
        pshufb  %xmm0, %xmm1
        pshufb  %xmm0, %xmm2
        punpcklqdq      %xmm2, %xmm1    # xmm1 = xmm1[0],xmm2[0]

This sucks everywhere, but sucks the most on Haswell and later which only have
1 shuffle port.  Skylake even has 2 per clock shift throughput.  If not for the
front-end bottleneck, it can execute 2 shifts + 1 shuffle per clock.  (There's
no load+shift instruction in AVX2, only AVX512.  And BTW clang auto-vectorizes
well for AVX512, with vpsrlw  $8, (%rsi,%rcx,2), %zmm0  /  vpmovwb %zmm0,
(%rdi,%rcx) which is probably optimal on Skylake-avx512.

I think the optimal strategy without AVX512 is to replace one shift with an AND
+ unaligned load offset by -1.  Especially with AVX, that lets the load+ALU
fold into one instruction.  (see https://stackoverflow.com/a/46477080/224132
for details.  With src 32B-aligned, this will never cache-line split, and
should be good on all CPUs back to Nehalem or K10, compiled with or without

       // uint8_t *dst, *src;
     __m128i v0 = _mm_loadu_si128((__m128i*)src);
     __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
     v0 = _mm_srli_epi16(v0, 8);
     __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_storeu_si128((__m128i*)dst, pack);

This works for AVX2 256b vectors with one extra  VPERMQ  at the end to fix-up
the in-lane vpackuswb behaviour (like gcc emits when auto-vectorizing).

clang's AVX2 auto-vectorization is *horrible* here: With  -fno-unroll-loops:

        vmovdqu (%rsi,%rax,2), %ymm1
        vpsrlw  $8, %ymm1, %ymm1
        vextracti128    $1, %ymm1, %xmm2
        vpshufb %xmm0, %xmm2, %xmm2
        vpshufb %xmm0, %xmm1, %xmm1
        vpunpcklqdq     %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
        vmovdqu %xmm1, (%rdi,%rax)

With default loop unrolling, it's basically that but with vinserti128 to feed
256b stores.  Because bottlenecking even harder on the shuffle port seems like
a great idea (with -march=skylake)...

(If this isn't a symptom of the same bug as the SSE2 sillyness, I guess this
part should be reported separately.)

