[LLVMdev] NEON intrinsics preventing redundant load optimization?

Sun Dec 7 11:15:51 PST 2014

Hi all,

I’m not sure if this is the right list, so apologies if not.

Doing some profiling I noticed some of my hand-tuned matrix multiply code with NEON intrinsics was much slower through a C++ template wrapper vs calling the intrinsics function directly. It turned out clang/LLVM was unable to eliminate a temporary even though the case seemed quite straightforward. Unfortunately any loads directly after NEON stores seem to be bad news on many arm cores (the wrapped version that stores to a temporary, then loads and stores back to the final location was almost 4x slower than the direct version without the temporary).

I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)

Here's a simplified test case:

struct vec4
{
	float data[4];
};

vec4 operator* (vec4& a, vec4& b)
{
	vec4 result;
	for(int i = 0; i < 4; ++i)
		result.data[i] = a.data[i] * b.data[i];

	return result;
}

void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
	result = a * b;
}

With -O3 the loop gets vectorized and the code generated looks optimal:

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
	vld1.32	{d16, d17}, [r1]
	vld1.32	{d18, d19}, [r0]
	vmul.f32	q8, q9, q8
	vst1.32	{d16, d17}, [r2]
	bx	lr

However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store.

vec4 operator* (vec4& a, vec4& b)
{
	vec4 result;

	float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data));
	vst1q_f32(result.data, result_data);

	return result;
}

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
	sub	sp, #16
	vld1.32	{d16, d17}, [r1]
	vld1.32	{d18, d19}, [r0]
	mov	r0, sp
	vmul.f32	q8, q9, q8
	vst1.32	{d16, d17}, [r0]
	vld1.32	{d16, d17}, [r0]
	vst1.32	{d16, d17}, [r2]
	add	sp, #16
	bx	lr

Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code?

Thanks,

Simon