[PATCH][AArch64] updated patches with initial implementation of Neon scalar instructions

Tue Sep 17 18:16:47 PDT 2013

Hi folks,

I have rebased my patches now that dependent pending patches are merged.

I also have made these additional changes:

1) Adopted the v1ix and v1if solution.
   I will revisit it when the "global instruction selection" is in place.

  Tim, can you talk more about this upcoming LLVM change?
	a) Will it still be SelectionDAG based?
	b) How having whole function knowledge will help me distinguish when
to create Integer and scalar Neon operations without adding the v1x and v1f
types?

2) Introduced a new operator OP_SCALAR_ALIAS to allow creating AArch64
scalar intrinsics that are alias to legacy ARM intrinisics.

   Example:
__ai int64_t vaddd_s64(int64_t __a, int64_t __b) {
  return (int64_t)vadd_s64((int64x1_t)__a, (int64x1_t)__b); }

Note that even with this change, the AArch64 intrinisc vaddd_s64 will NOT
generate "add d0, d1, d0" but the optimized code "add x0, x1, x0" because of
the castings to in64_t.

I experimented with compiling the aarch64-neon-intrinsics.c with -O0 instead
of -O3, but instruction combining pass still makes this optimization.

So we are really dependent on the compiler optimizations here.

But note that directly calling ARM legacy intrinsic vadd_s64 produces "add
d0, d1, d0", since the inputs are v1i64 type and I have the proper
instruction selection pattern defined.

3) Got rid of int_aarch64_sisd_add(u,s)64 and int_aarch64_sisd_add(u,s)64
intrinsics, as a side-effect of implementing (2).

Removing these intrinsics we cannot guarantee vaddd_(s,u)64 and
vsubd_(s,u)64 will produce "add/sub d0, d1, d0".
I am allowing these intrinsics to generate Integer code, which is the best
implementation of these intrinsics, as Tim pointed out.
I updated the tests accordingly.

4) Used FMOV instead of UMOV to move registers from Neon/integer units when
possible

For types of size 32 and 64 I tried to make use of FMOV instructions. For
types of size 8 and 16, I make use of the UMOV instructions.

Let me know if you have any more comments on these patches.

Thanks,
Ana.

-----Original Message-----
From: Tim Northover [mailto:t.p.northover at gmail.com] 
Sent: Friday, September 13, 2013 2:02 AM
To: Kevin Qin
Cc: Ana Pazos; rajav at codeaurora.org; llvm-commits; cfe-commits at cs.uiuc.edu
Subject: Re: [PATCH][AArch64]RE: patches with initial implementation of Neon
scalar instructions

Hi Kevin,

> From my perspective, DAG should only hold operations with value type, 
> but not a certain register class. Which register class to be used is 
> decided by compiler after some cost calculation. If we bind v1i32 and 
> v1i64 to FPR, then it's hard for compiler to make this optimization.

In an ideal world, I completely agree. Unfortunately the SelectionDAG
infrastructure just doesn't make these choices intelligently. It looks at
each node in isolation and chooses an instruction based on the types
involved. If there were two "(add i64:$Rn, i64:$Rm)" patterns then only one
of them would ever match.

I view this v1iN nonsense as an unfortunate but necessary temporary measure,
until we get our global instruction selection.

I think the only way you could get LLVM to produce both an "add x, x, x" and
an "add d, d, d" from sensible IR without it would be a separate
(MachineInstr) pass which goes through afterwards and patches things up.

The number of actually duplicated instructions is small enough that this
might be practical, but it would have its own ugliness even if it worked
flawlessly (why v1i8, v1i16 but i32 and i64? There's a good reason, but it's
not pretty).

I'm not implacably opposed to the approach, but I think you'd find
implementing it quite a bit of work. Basically, the main thing I want to
avoid is an int_aarch64_sisd_add intrinsic. That seems like it's the worst
of all possible worlds.

Cheers.

Tim.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: latest-clang-scalar-arith-scalar-reduce-pairwise-version2
Type: application/octet-stream
Size: 24613 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130917/eddef407/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: latest-llvm-scalar-arith-scalar-reduce-pairwise-version2
Type: application/octet-stream
Size: 97535 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130917/eddef407/attachment-0001.obj>