[LLVMdev] NEON intrinsics preventing redundant load optimization?

Wed Dec 10 03:13:50 PST 2014

On 10 Dec 2014, at 09:19, Simon Taylor <simontaylor1 at ntlworld.com> wrote:

> On 9 Dec 2014, at 02:20, Jim Grosbach <grosbach at apple.com> wrote:
>> 
>> FWIW, with top of tree clang, I get the same (good) code for both of the implementations of operator* in the original email. That appears to be a fairly recent improvement, though I haven’t bisected it down or anything.
> 
> Thanks for the note. I’m building a recent checkout now to see if I see the same behaviour.

Things have definitely improved with the top of tree, and as Jim reported the examples I started with now generate the expected code. The NEON load/stores still appear in the IR rather than the load <4 x float> from the auto-vectorized C.

There’s an additional example in the bug report [http://llvm.org/bugs/show_bug.cgi?id=21778] that tests chained multiply (ie res = a * b * c). In this case the current top of tree clang still has one redundant temporary.

> It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float>* %1” form? Or is that more a question for the clang folks?

I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing this?

typedef float32x4_t __attribute((aligned(4))) f32x4_align4_t;

vec4 operator* (const vec4& a, const vec4& b)
{
  vec4 result;

  float32x4_t a_data = *((f32x4_align4_t*)a.data);
  float32x4_t b_data = *((f32x4_align4_t*)b.data);

  float32x4_t result_data = vmulq_f32(a_data, b_data);

  *((f32x4_align4_t*)result.data) = result_data;

  return result;
}