[LLVMdev] NEON intrinsics preventing redundant load optimization?

Sun Dec 7 15:57:51 PST 2014

----- Original Message -----
> From: "Simon Taylor" <simontaylor1 at ntlworld.com>
> To: llvmdev at cs.uiuc.edu
> Sent: Sunday, December 7, 2014 1:15:51 PM
> Subject: [LLVMdev] NEON intrinsics preventing redundant load optimization?
> 
> Hi all,
> 
> I’m not sure if this is the right list, so apologies if not.

This is not a bad place ;)

> 
> Doing some profiling I noticed some of my hand-tuned matrix multiply
> code with NEON intrinsics was much slower through a C++ template
> wrapper vs calling the intrinsics function directly. It turned out
> clang/LLVM was unable to eliminate a temporary even though the case
> seemed quite straightforward. Unfortunately any loads directly after
> NEON stores seem to be bad news on many arm cores (the wrapped
> version that stores to a temporary, then loads and stores back to
> the final location was almost 4x slower than the direct version
> without the temporary).
> 
> I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version
> 6.0 (clang-600.0.56) (based on LLVM 3.5svn)
> 
> Here's a simplified test case:
> 
> struct vec4
> {
> 	float data[4];
> };
> 
> vec4 operator* (vec4& a, vec4& b)
> {
> 	vec4 result;
> 	for(int i = 0; i < 4; ++i)
> 		result.data[i] = a.data[i] * b.data[i];
> 
> 	return result;
> }
> 
> void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
> {
> 	result = a * b;
> }
> 
> With -O3 the loop gets vectorized and the code generated looks
> optimal:
> 
> __Z16TestVec4MultiplyR4vec4S0_S0_:
> @ BB#0:
> 	vld1.32	{d16, d17}, [r1]
> 	vld1.32	{d18, d19}, [r0]
> 	vmul.f32	q8, q9, q8
> 	vst1.32	{d16, d17}, [r2]
> 	bx	lr
> 
> However if I replace the operator* with a NEON intrinsic
> implementation (I know the vectorizer figured out optimal code in
> this case anyway, but that wasn't true for my real situation) then
> the temporary "result" seems to be kept in the generated code for
> the test function, and triggers the bad penalty of a load after a
> NEON store.
> 
> vec4 operator* (vec4& a, vec4& b)
> {
> 	vec4 result;
> 	
> 	float32x4_t result_data = vmulq_f32(vld1q_f32(a.data),
> 	vld1q_f32(b.data));
> 	vst1q_f32(result.data, result_data);
> 
> 	return result;
> }
> 
> __Z16TestVec4MultiplyR4vec4S0_S0_:
> @ BB#0:
> 	sub	sp, #16
> 	vld1.32	{d16, d17}, [r1]
> 	vld1.32	{d18, d19}, [r0]
> 	mov	r0, sp
> 	vmul.f32	q8, q9, q8
> 	vst1.32	{d16, d17}, [r0]
> 	vld1.32	{d16, d17}, [r0]
> 	vst1.32	{d16, d17}, [r2]
> 	add	sp, #16
> 	bx	lr
> 
> Is there something about the use of intrinsics that prevents the
> compiler optimizing out the redundant store on the stack?

I recommend filing a bug report so that someone can look at this in detail. You can do this at llvm.org/bugs -- select "libraries" as the product, and then "Scalar Optimizations" as the component (that's probably right, and we can always change it if it turns out the problem lies elsewhere).

In the mean time, I recommend trying to pass by value, instead of by reference, in your multiplication operator. It is hard to say without looking at the code in detail, but it is easier for the compiler to analyze:
  vec4 operator* (vec4 a, vec4 b)
than to analyze:
  vec4 operator* (vec4& a, vec4& b)

 -Hal

> Is there
> any hope for this improving in the future, or anything I can do now
> to improve the generated code?
> 
> Thanks,
> 
> Simon
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory