[PATCH][AArch64]RE: patches with initial implementation of Neon scalar instructions

Wed Sep 11 14:07:21 PDT 2013

Hi Tim,

This intrinsic is legacy ARMv7:

               int64x1_t vadd_s64(int64x1_t a, (int64x1_t b) 

I can generate "add d0, d1, d0" from this one using IR operations. This is
not a problem.

The ARMv8 intrinsic is:

               int64_t vaddd_s64(int64_t a, (int64_t b)

I was just pointing out that if I define the ARMv8 intrinsic using the
legacy ARMv7 intrinsic produces code like this:

               (int64_t) vadd_s64((int64x1_t(a), int64x1_t(b))

which results in "add x0, x1, x0".

The same happens if I translate the ARMv8 builtin for vaddd_s64 into
CreateAdd code in CGBuiltin.

So coming from a builtin for ARMv8 ACLE scalar intrinsic it does not look
like I can use the IR operations to guarantee that Scalar Neon instructions
are selected. So I defined a LLVM intrinsic.

Now we need to confirm what is the expected implementation for a Neon
intrinisic - to produce only Neon code or produce best code possible?

The spreadsheet I have with AArch64 intrinsics definitions shows Neon
instruction is expected: 

Intrinsic name

Instruction Generated

Operands

Notes

Example prototypes

Scalar add

vaddd_s64

ADD

Dd,Da,Db

vaddd_u64

ADD

Dd,Da,Db

This is one of the side effects to using v1ix and v1fx in the backend.

Thanks,

Ana.

-----Original Message-----
From: Tim Northover [mailto:t.p.northover at gmail.com] 
Sent: Wednesday, September 11, 2013 7:35 AM
To: Ana Pazos
Cc: llvm-commits; cfe-commits at cs.uiuc.edu; rajav at codeaurora.org;
mcrosier at codeaurora.org
Subject: Re: patches with initial implementation of Neon scalar instructions

Hi Ana,

>       Because the Clang builtins use scalar types, if in CGBuiltin I 

> transform the code below into an IR operation using CreateAdd, the 

> code is optimized and results in the machine instruction 'add X0, X1, 

> X0', which is not what we want.

I wonder if that's actually true. Realistically, the function you wrote *is*
better implemented with "add x0, x1, x0", instead of three fmovs and an
"add d0, d1, d0".

If you put a call to your vaddd_s64 back into a "vector" context, where it
*does* make sense to use the "add d0, d1, d0" version then I think LLVM will
get it right again:

int64x1_t my_own_little_function(int64x1_t a, int64x1_t b) {

  return vaddd_s64((int64_t)a, (int64_t)b); }

After inlining I'd expect the optimised IR to still contain an "add <1 x
i64>" here and the assembly to use the "add d0, d1, d0" form (in this case
faster than 3 fmovs and an "add x0, x1, x0").

Obviously LLVM isn't perfect at spotting these contexts yet, but I don't
think we should be hobbling it by insisting on a SISD add just because
that's what the intrinsic notionally maps to.

Cheers.

Tim.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130911/79f04f32/attachment.html>