[cfe-users] Puzzling vector optimisation inconsistency
Chris Webb via cfe-users
cfe-users at lists.llvm.org
Sun Oct 11 14:03:52 PDT 2020
I've been investigating performance inconsistencies of some vector code
when compiled with clang.
Trying to boil it down to a minimal example, I'm puzzled by the following:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
static uint8_t buffer[1 << 28];
int main(void) {
CLASS uint32x4_t state = { 1, 2, 3, 4 };
for (size_t i = 0; i < sizeof(buffer); i++)
buffer[i] = i;
double start = (double) clock() / CLOCKS_PER_SEC;
for (size_t j = 0; j < sizeof(buffer) - 15; j += 16) {
/* XOR in a chunk of buffer ignoring endianness: */
for (uint8_t i = 0; i < 16; i++)
((uint8_t *) &state)[i] ^= buffer[j + i];
/* Do some random vector work on top of each chunk: */
state = state * state | (uint32x4_t) { 5, 5, 5, 5 };
}
double finish = (double) clock() / CLOCKS_PER_SEC;
printf("%0.1f MB/s: ", sizeof(buffer) / (finish - start) / (1 << 20));
printf("%08x,%08x,%08x,%08x\n", state[0], state[1], state[2], state[3]);
return EXIT_SUCCESS;
}
This program runs at a quarter of the speed compiled with -DCLASS= (so the
state vector is automatic) compared to when it is compiled with
-DCLASS=static (so the state vector is static):
$ clang -Wall -DCLASS= -O3 test.c -o test && ./test
819.7 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5
$ clang -Wall -DCLASS=static -O3 test.c -o test && ./test
3519.1 MB/s: 71453c4d,b15df5e5,64535a1d,68709dd5
It is also fast if the state vector is moved out to global/file scope. The
behaviour is the same between two different x86-64 clangs on two different
OSes:
$ clang --version
Alpine clang version 10.0.1
Target: x86_64-alpine-linux-musl
Thread model: posix
InstalledDir: /usr/bin
$ clang --version
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
I know the inner-16 loop is silly and could be written as a single vector
XOR against a cast chunk of buffer, but this is heavily boiled-down. (The
real code isn't really amenable to being transformed like that. It does a
lot more vector work too, so the performance effect is more subtle.)
Despite the byte-wise inner-loop, the compiler does a superb job when the
state vector is declared static or global, storing it in a vector register
without writing to memory at all, and optimising the XOR to a single vector
operation.
Is there any way I can coax it into compiling the auto state as efficiently
as the static one? Is there something I've underspecified here, so I 'get
lucky' in the one case?
Many thanks in advance for any help or pointers anyone can offer.
Best wishes,
Chris.
PS One thing I wonder is if there's there a cleaner way to access bytes of
the vector as lvalues that will optimise more consistently, but I can't see
an improvement that doesn't perform worse. For example, one experiment I
tried was to replace uint32x4_t state with a union:
static union {
uint32x4_t u32;
uint8x16_t u8;
} state;
and use state.u8[i ^ 3] ^= ... to update the bytes instead of ((uint8_t *)
&state)[i ^ 3] ^= ...
But this makes it always slow (like the auto case) instead of always fast
(like the static case).
More information about the cfe-users
mailing list