[PATCH][AArch64] updated patches with initial implementation of Neon scalar instructions

Wed Sep 18 10:28:30 PDT 2013

Thanks Tim and Kevin,

I will bring back the solution with intrinsics and look at the failing test.

Kevin, I produced the patch on top of yesterday's tip and all tests passed
then. I will try again with today's tip.

Ana.

-----Original Message-----
From: Tim Northover [mailto:t.p.northover at gmail.com] 
Sent: Wednesday, September 18, 2013 7:13 AM
To: Ana Pazos
Cc: rajav at codeaurora.org; llvm-commits; cfe-commits at cs.uiuc.edu; Kevin Qin;
mcrosier at codeaurora.org
Subject: Re: [PATCH][AArch64] updated patches with initial implementation of
Neon scalar instructions

Hi Ana,

>   Tim, can you talk more about this upcoming LLVM change?

The most detailed information is in Jakob's RFC thread a little while
back:
http://llvm.1065342.n5.nabble.com/global-isel-Proposal-for-a-global-instruct
ion-selector-td60331.html

>         a) Will it still be SelectionDAG based?

No. One of the main goals is to get rid of SelectionDAG because of various
limitations and its complexity. No code's been written yet so it's all very
nebulous, but it may well still use most of the patterns in .td files (as
FastISel does, or in some improved fashion).

>         b) How having whole function knowledge will help me 
> distinguish when to create Integer and scalar Neon operations without 
> adding the v1x and v1f types?

The idea is that LLVM will have two (add i64:$Rn, i64:$Rm) patterns, but
distinguish them by the register bank they operate on.

It'll then look at the entire function and decide which register-bank any
given operation would be best in, (based on register-pressure, available
instructions, surrounding instructions etc). This would let it pick the
GPR64 or FPR64 "add" as appropriate.

>    Example:
> __ai int64_t vaddd_s64(int64_t __a, int64_t __b) {
>   return (int64_t)vadd_s64((int64x1_t)__a, (int64x1_t)__b); }
>
> Note that even with this change, the AArch64 intrinisc vaddd_s64 will 
> NOT generate "add d0, d1, d0" but the optimized code "add x0, x1, x0" 
> because of the castings to in64_t.

I see what you mean. @vaddd_s64 gets optimised to a simple "add i64"
and LLVM doesn't decide to undo that after it's been inlined into a caller.
I was sure I had tested that worked, but apparently not properly.

The final IR is:

define <1 x i64> @my_own_little_function(<1 x i64> %a, <1 x i64> %b) #0 {
 %0 = extractelement <1 x i64> %a, i32 0
  %1 = extractelement <1 x i64> %b, i32 0
  %2 = add i64 %1, %0
  %3 = insertelement <1 x i64> undef, i64 %2, i32 0
  ret <1 x i64> %3
}

which is about as vectory as you can get except for that "add" in the middle
there.

I think I was wrong about the intrinsics here, and your first solution was
the best available. How easy would it be to add them back in?

> 4) Used FMOV instead of UMOV to move registers from Neon/integer units 
> when possible

That sounds sensible.

Tim.