[PATCH][AArch64]RE: patches with initial implementation of Neon scalar instructions
Ana Pazos
apazos at codeaurora.org
Wed Sep 11 14:07:21 PDT 2013
Hi Tim,
This intrinsic is legacy ARMv7:
int64x1_t vadd_s64(int64x1_t a, (int64x1_t b)
I can generate "add d0, d1, d0" from this one using IR operations. This is
not a problem.
The ARMv8 intrinsic is:
int64_t vaddd_s64(int64_t a, (int64_t b)
I was just pointing out that if I define the ARMv8 intrinsic using the
legacy ARMv7 intrinsic produces code like this:
(int64_t) vadd_s64((int64x1_t(a), int64x1_t(b))
which results in "add x0, x1, x0".
The same happens if I translate the ARMv8 builtin for vaddd_s64 into
CreateAdd code in CGBuiltin.
So coming from a builtin for ARMv8 ACLE scalar intrinsic it does not look
like I can use the IR operations to guarantee that Scalar Neon instructions
are selected. So I defined a LLVM intrinsic.
Now we need to confirm what is the expected implementation for a Neon
intrinisic - to produce only Neon code or produce best code possible?
The spreadsheet I have with AArch64 intrinsics definitions shows Neon
instruction is expected:
Intrinsic name
Instruction Generated
Operands
Notes
Example prototypes
Scalar add
vaddd_s64
ADD
Dd,Da,Db
vaddd_u64
ADD
Dd,Da,Db
This is one of the side effects to using v1ix and v1fx in the backend.
Thanks,
Ana.
-----Original Message-----
From: Tim Northover [mailto:t.p.northover at gmail.com]
Sent: Wednesday, September 11, 2013 7:35 AM
To: Ana Pazos
Cc: llvm-commits; cfe-commits at cs.uiuc.edu; rajav at codeaurora.org;
mcrosier at codeaurora.org
Subject: Re: patches with initial implementation of Neon scalar instructions
Hi Ana,
> Because the Clang builtins use scalar types, if in CGBuiltin I
> transform the code below into an IR operation using CreateAdd, the
> code is optimized and results in the machine instruction 'add X0, X1,
> X0', which is not what we want.
I wonder if that's actually true. Realistically, the function you wrote *is*
better implemented with "add x0, x1, x0", instead of three fmovs and an
"add d0, d1, d0".
If you put a call to your vaddd_s64 back into a "vector" context, where it
*does* make sense to use the "add d0, d1, d0" version then I think LLVM will
get it right again:
int64x1_t my_own_little_function(int64x1_t a, int64x1_t b) {
return vaddd_s64((int64_t)a, (int64_t)b); }
After inlining I'd expect the optimised IR to still contain an "add <1 x
i64>" here and the assembly to use the "add d0, d1, d0" form (in this case
faster than 3 fmovs and an "add x0, x1, x0").
Obviously LLVM isn't perfect at spotting these contexts yet, but I don't
think we should be hobbling it by insisting on a SISD add just because
that's what the intrinsic notionally maps to.
Cheers.
Tim.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130911/79f04f32/attachment.html>
More information about the llvm-commits
mailing list