[cfe-users] Clang vectorization

Tue Mar 17 16:46:02 PDT 2015

Hello cfe-users,

I’m trying to get clang (or GCC for that matter) to vectorize a very simple loop, and I’m wondering what I’m doing wrong.  I’d rather write the loop as a loop instead of using intrinsics or the clang vector extensions, because I want the code to be portable.  Pragmas and magic attributes are also undesirable, but they’re better than intrinsics.

This file is representative of what I’m trying to do.  I’m compiling with -O3 -std=c99 -mavx2, but the same issues should apply for other vector settings.
“””
#include <stdint.h>

typedef struct this_should_totally_be_a_vector {
    uint64_t limb[8];
} __attribute__((aligned(32))) a_vector;

void add(a_vector *a, const a_vector *b) {
    for (int i=0; i<8; i++) a->limb[i] += b->limb[i];
}

void mac(a_vector *a, const a_vector *b) {
    const a_vector c = {{0,1,2,3,4,5,6,7}};
    for (int i=0; i<8; i++) a->limb[i] += b->limb[i] + 3*c.limb[i];
}
“””

Can someone suggest flags, pragmas, attributes etc which would cause these functions to produce good code?  I’m seeing lots of problems.  I’m testing for now on clang-3.6 release.

For starters, the compiler is unable to determine that there is no loop dependency, and therefore unrolls the loop instead of vectorizing.  When passed #pragma clang loop unroll(disable) vectorize(enable), it is still not able to determine that there is no dependency, and so branches to a scalar version if a is close to b.  Furthermore, it ignores the alignment hint and uses vmovdqu for everything, though maybe that doesn’t actually cost any performance.  In fact, there cannot be a loop dependency both because of the alignment and because the arrays are in structs.

Clang produces the correct code if a is declared __restrict__, but in the real code it is possible that a=b so I’d rather not say __restrict__ if I don’t have to (especially since the code may be inlined, possibly causing alias analysis to break).  GCC has #pragma GCC ivdep, which causes it to vectorize properly, but does Clang have any equivalent to #pragma ivdep?  Also, __restrict__ still doesn’t give me vmovdqa.

For mac, with __restrict__ (again undesirable) I get decent 2-way vectorized sse3 code, which isn’t bad I guess, but I’d rather the compiler automatically produced 4-way avx2 code.  If I add #pragma clang loop unroll(disable) vectorize(enable), I get
“”"
	vmovdqa	mac.c(%rip), %ymm0
	vpbroadcastq	.LCPI2_0(%rip), %ymm1
	vpmuludq	%ymm1, %ymm0, %ymm2
	vpxor	%ymm3, %ymm3, %ymm3
	vpmuludq	%ymm3, %ymm0, %ymm4
	vpsllq	$32, %ymm4, %ymm4
	vpaddq	%ymm4, %ymm2, %ymm2
	vpsrlq	$32, %ymm0, %ymm0
	vpmuludq	%ymm1, %ymm0, %ymm0
	vpsllq	$32, %ymm0, %ymm0
	vpaddq	%ymm0, %ymm2, %ymm0
	vpaddq	(%rsi), %ymm0, %ymm0
	vpaddq	(%rdi), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rdi)
	vmovdqa	mac.c+32(%rip), %ymm0
	vpmuludq	%ymm1, %ymm0, %ymm2
	vpmuludq	%ymm3, %ymm0, %ymm3
	vpsllq	$32, %ymm3, %ymm3
	vpaddq	%ymm3, %ymm2, %ymm2
	vpsrlq	$32, %ymm0, %ymm0
	vpmuludq	%ymm1, %ymm0, %ymm0
	vpsllq	$32, %ymm0, %ymm0
	vpaddq	%ymm0, %ymm2, %ymm0
	vpaddq	32(%rsi), %ymm0, %ymm0
	vpaddq	32(%rdi), %ymm0, %ymm0
	vmovdqu	%ymm0, 32(%rdi)
	vzeroupper
	retq
“”"
In other words, clang has failed to propagate constants, and is trying to do 64-bit multiplies (lowered to vpsllq and vpmuludq) at runtime.

Can anyone help me get decent, portable code out of this?  GCC performs well on add with #pragma GCC ivdep, but it also does silly things with mul.

Is there a way to do this which doesn’t depend on intrinsics or extensions?  If I absolutely have to write this with intrinsics or extensions, is there a nice way to do it which doesn’t change the struct definition and doesn’t break strict aliasing?

Thanks a lot,
— Mike