<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Use PSADBW for horizontal uint8_t byte sums (and accumulate multiple booleans before using it), instead of widening right away"

   href="https://bugs.llvm.org/show_bug.cgi?id=40686">40686</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Use PSADBW for horizontal uint8_t byte sums (and accumulate multiple booleans before using it), instead of widening right away

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Same code as PR40685, but this bug is about the other major optimizations that

are possible, not the silly extra shuffles:

// <a href="https://godbolt.org/z/1SEmTu">https://godbolt.org/z/1SEmTu</a>  (same Godbolt link as the other PR)

int count(const bool *visited, int len) {

    int counter = 0;

    for(int i=0;i<100;i++) {  // len unused or not doesn't matter

        if (visited[i]==0)

            counter++;

    }

    return counter;

}

(adapted from:

<a href="https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a">https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a</a>)

At best, once <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [5,6,7,8,9 regression] auto-vectorization unpacks, repacks, and unpacks to 32-bit again for count += (bool_arr[i]==0) for boolean array, using 3x the shuffles needed"

   href="show_bug.cgi?id=40685">Bug 40685</a> is fixed, clang / LLVM is probably doing something like 

        pmovzxbd        xmm3, dword ptr [rdi + rax]

        pxor    xmm3, xmm1            # flip the bits

        paddd   xmm0, xmm3

This costs us a shuffle and 2 other ALU instructions per 4 bools, regardless of

unrolling.

## Other missed optimizations

For large arrays, we only need ~1.25 vector ALU instruction (and a pure load)

plus loop overhead per 16 / 32 / 64 bools (1 per vector with an extra ALU

amortized over a minor unroll by 4).  See below.

Instead of flipping inside the loop, we can do

   for()

      notcount += arr[i];

   return 100-notcount;

We can prove that this can't overflow notcount, because len is small enough and

bool->int can only be 0 or 1.

Byte counters won't overflow for a compile-time-constant array size of 100, so

we can simply sum outside the loop.

count:   # for hard-coded size=100, fully unrolled

    vmovdqu  ymm0, [rdi]

    vpaddb   ymm0, [rdi+32]         # asm syntax leaving out first src = dst

operand

    vpaddb   ymm0, [rdi+64]

    ; 3 * 32 = 96 elements fully unrolled

    movd      xmm1, [rdi]

    vpaddb    ymm0, ymm1         # add the last 4 elements

    # then hsum the 32x byte accumulators

    vpxor      xmm1, xmm1

    vpsadbw    ymm0, ymm1        # hsum unsigned bytes into 64-bit elements

    vextracti128 xmm1, ymm0, 1

    vpaddd     xmm0, xmm1

    vpunpckhqdq  xmm1, xmm0,xmm0   # saves an imm8 of code size vs. vpshufd

    vpaddd     xmm0, xmm1

    # and do 100 - that

    vmovd      edx, xmm0         # original source used int, so it's only a

32-bit result

    mov        eax, 100

    sub        eax, edx

    ret

Probably actually best to vextracti128 + paddb first, then psadbw xmm (if we

care about Excavator / Ryzen), but that would make overflow of byte elements

possible for sizes half as large.  If we're using that hsum of bytes as a

canned sequence,  probably easiest to have one that works for all sizes up to

255 vectors.  It's a minor difference, like 1 extra ALU uop.

32 * 255 = 8160 fits in 16 bits, so it really doesn't matter what SIMD element

size we use on the result of psadbw.  PADDQ is slower than PADDD/W/B on

Atom/Silvermont including Goldmont Plus (non-AVX CPUs only), so we might as

well avoid it so the same auto-vectorization pattern is good with just SSE2 /

SSE4 on those CPUs.

Signed int overflow is undefined behaviour, but this is counting something

different than the source; the C abstract machine never has a sum of the true

elements in an `int`.  So we'd better only do it in a way that can't overflow.

## For unknown array sizes, we can use psadbw inside the inner loop.  e.g.

(Without AVX, we'd either need an alignment guarantee or separate movdqu loads,

so for large len reaching an alignment boundary could become valuable to avoid

front-end bottlenecks.)

    vpxor    ymm1, ymm1

.loop

    vmovdqu  ymm0, [rdi]

    vpaddb   ymm0, [rdi+32]         # leaving out first src = dst

    vpaddb   ymm0, [rdi+64]

    sub      rdi, -128

    vpaddb   ymm0, [rdi + 96 - 128]  # still using a disp8 addressing mode

    vpsadbw  ymm0, ymm1             # hsum bytes to qword elements

    vpaddq   ymm2, ymm0             # accumulate into a qword vector

    cmp      rdi, rsi

    jb      .loop

    return len - hsum(ymm2)

Or if we can't / don't want to sink the boolean inversion out of the loop, we

can start with a vector of set1_epi8( 4 ) and do 

    vpsubb  ymm0, ymm3, [rdi]          ; 4 - v0.  ymm3 = set1(4)

    vpsubb  ymm0, ymm0, [rdi+32]       ; . - v1

    ...

Or other even-less optimal ways of flipping inside the loop before summing.

In the general case of counting compare results, vpcmpeqb / vpsubb is good.

---

For hot loops we can put off psadbw for up to 255 iterations (254 vpaddb)

without overflowing a byte sum, but branch misses for inner-loop counts over

~22 on Skylake probably make that what we should aim for.  Probably with 2

accumulators inside an inner loop to hide latency and maybe allow 2x load+add

per clock.  More than 2 accumulators probably isn't worth it for a short inner

loop; we're breaking that short inner-loop dep chain every outer-loop iteration

so out-of-order exec can overlap the end of one with the start of the next and

find ILP that way, if imperfect scheduling delays the inner loop.  (Of course

at that point you're probably not running from L1d anymore and we're looking at

memory delays.)

Anyway, see

<a href="https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd">https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd</a>

for an example of this.  (Where we get a 0 or -1 vector from pcmpeqb compare

result).

---------

---------

Even less optimal would be to PXOR + PSADBW for every vector of bytes, into a

vector of qword or dword accumulators.

Then you'd have fully-standard auto-vectorization with counter vectors having

the same element width as the C variable, and the same contents, so you

wouldn't have to prove anything about whether it overflowed or not.

But as a first step, that might be easier to teach llvm about?

Plus it works for summing any uint8_t, not taking advantage of bool (or compare

results) being able to be added (subtracted) without overflow for multiple

iterations.

-----

PSADBW is single-uop on pretty much everything from Merom onwards, even Atom

and P4; it's critical for video encoding motion search where it's used with

both inputs being variable, not constant zero.

e.g. on Haswell it's 1 uop for p0 only, 5 cycle latency.  On SKL it moved to P5

and is down to 3c latency, but competes with shuffles.

KNL has slow VPSADBW YMM, but even its VPSADBW XMM is not bad.  (It doesn't

have AVX512BW, so it doesn't have VPSADBW ZMM.)  KNL is slow for a lot of

integer AVX2 YMM instructions that it doesn't have a 512-bit version of.

---

Horizontal sums of signed bytes (int8_t) without overflow can be done by

range-shifting to unsigned with XOR to flip the high bits, then PSADBW, then

subtract 16 * 0x80 or whatever at the very end to undo the bias.

This is definitely worth it vs. signed unpacking with pmovsx or worse, punpck +

arithmetic shifts to manually do the sign extension.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>