[cfe-users] Clang vectorization
Michael Hamburg
mike at shiftleft.org
Tue Mar 17 16:46:02 PDT 2015
Hello cfe-users,
I’m trying to get clang (or GCC for that matter) to vectorize a very simple loop, and I’m wondering what I’m doing wrong. I’d rather write the loop as a loop instead of using intrinsics or the clang vector extensions, because I want the code to be portable. Pragmas and magic attributes are also undesirable, but they’re better than intrinsics.
This file is representative of what I’m trying to do. I’m compiling with -O3 -std=c99 -mavx2, but the same issues should apply for other vector settings.
“””
#include <stdint.h>
typedef struct this_should_totally_be_a_vector {
uint64_t limb[8];
} __attribute__((aligned(32))) a_vector;
void add(a_vector *a, const a_vector *b) {
for (int i=0; i<8; i++) a->limb[i] += b->limb[i];
}
void mac(a_vector *a, const a_vector *b) {
const a_vector c = {{0,1,2,3,4,5,6,7}};
for (int i=0; i<8; i++) a->limb[i] += b->limb[i] + 3*c.limb[i];
}
“””
Can someone suggest flags, pragmas, attributes etc which would cause these functions to produce good code? I’m seeing lots of problems. I’m testing for now on clang-3.6 release.
For starters, the compiler is unable to determine that there is no loop dependency, and therefore unrolls the loop instead of vectorizing. When passed #pragma clang loop unroll(disable) vectorize(enable), it is still not able to determine that there is no dependency, and so branches to a scalar version if a is close to b. Furthermore, it ignores the alignment hint and uses vmovdqu for everything, though maybe that doesn’t actually cost any performance. In fact, there cannot be a loop dependency both because of the alignment and because the arrays are in structs.
Clang produces the correct code if a is declared __restrict__, but in the real code it is possible that a=b so I’d rather not say __restrict__ if I don’t have to (especially since the code may be inlined, possibly causing alias analysis to break). GCC has #pragma GCC ivdep, which causes it to vectorize properly, but does Clang have any equivalent to #pragma ivdep? Also, __restrict__ still doesn’t give me vmovdqa.
For mac, with __restrict__ (again undesirable) I get decent 2-way vectorized sse3 code, which isn’t bad I guess, but I’d rather the compiler automatically produced 4-way avx2 code. If I add #pragma clang loop unroll(disable) vectorize(enable), I get
“”"
vmovdqa mac.c(%rip), %ymm0
vpbroadcastq .LCPI2_0(%rip), %ymm1
vpmuludq %ymm1, %ymm0, %ymm2
vpxor %ymm3, %ymm3, %ymm3
vpmuludq %ymm3, %ymm0, %ymm4
vpsllq $32, %ymm4, %ymm4
vpaddq %ymm4, %ymm2, %ymm2
vpsrlq $32, %ymm0, %ymm0
vpmuludq %ymm1, %ymm0, %ymm0
vpsllq $32, %ymm0, %ymm0
vpaddq %ymm0, %ymm2, %ymm0
vpaddq (%rsi), %ymm0, %ymm0
vpaddq (%rdi), %ymm0, %ymm0
vmovdqu %ymm0, (%rdi)
vmovdqa mac.c+32(%rip), %ymm0
vpmuludq %ymm1, %ymm0, %ymm2
vpmuludq %ymm3, %ymm0, %ymm3
vpsllq $32, %ymm3, %ymm3
vpaddq %ymm3, %ymm2, %ymm2
vpsrlq $32, %ymm0, %ymm0
vpmuludq %ymm1, %ymm0, %ymm0
vpsllq $32, %ymm0, %ymm0
vpaddq %ymm0, %ymm2, %ymm0
vpaddq 32(%rsi), %ymm0, %ymm0
vpaddq 32(%rdi), %ymm0, %ymm0
vmovdqu %ymm0, 32(%rdi)
vzeroupper
retq
“”"
In other words, clang has failed to propagate constants, and is trying to do 64-bit multiplies (lowered to vpsllq and vpmuludq) at runtime.
Can anyone help me get decent, portable code out of this? GCC performs well on add with #pragma GCC ivdep, but it also does silly things with mul.
Is there a way to do this which doesn’t depend on intrinsics or extensions? If I absolutely have to write this with intrinsics or extensions, is there a nice way to do it which doesn’t change the struct definition and doesn’t break strict aliasing?
Thanks a lot,
— Mike
More information about the cfe-users
mailing list