[PATCH] [AArch64] Add v8.1a "Rounding Doubling Multiply Add/Subtract" extension

James Molloy james.molloy at arm.com
Fri Mar 27 07:48:04 PDT 2015


Hi Vladimir,

I've checked out your patch and fiddled around with it. It is possible, but ugly, to match your unmatchable pattern.

First, we need to properly legalize the intrinsic. It has type i16 (and takes i16 arguments). I16 is illegal so needs to be promoted, but the generic code can't promote it for you so we need to do it ourselves. There are two ways to do this: either create a new AArch64ISD:: node for this operation or operate on ISD::INTRINSIC_WO_CHAIN nodes themselves.  For simplicity I've done the latter.

First, we need to tell the target-agnostic gubbins that we want to custom lower intrinsic nodes:

  // Somewhere near AArch64ISelLowering.cpp:120
  setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i16, Custom);

Now we need to implement the custom lowering.

  // In AArch64ISelLowering.cpp , function ReplaceNodeResults() 
    case ISD::INTRINSIC_WO_CHAIN: {                                                                                                                                                         
      auto ID = getIntrinsicID(N);
      if ((ID == Intrinsic::aarch64_neon_sqrdmulh ||
           ID == Intrinsic::aarch64_neon_sqadd) &&
          N->getValueType(0) == MVT::i16) {
        // Promote to i32.
        SDLoc DL(N);                                                                                                                                                  
                                                                                                                                                                      
        auto Op0 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(1));                                                                                      
        auto Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(2));                                                                                      
                                                                                                                                                                      
        auto NN = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,                                                                                                  
                              DAG.getConstant(ID, MVT::i32),                                                                                                          
                              Op0, Op1);                                                                                                                              
        NN = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, NN);                                                                                                            
        Results.push_back(NN);                                                                                                                                        
      }                                                                                                                                                               
      return;                                                                                                                                                         
    }

With this change, we can get code that at least doesn't crash:

  umov    w8, v0.h[3]
  fmov    s0, w0
  fmov    s1, w1
  fmov    s2, w8
  sqrdmlah        s0, s1, s2
  fmov    w0, s0
  ret

That uses the i32 variant of the sqrdmlah instruction. We need to do at least this much, I think, because we can't have intrinsics that just crash the compiler.

Now, matching the pattern. The pattern we need to match is basically the same as the i32_indexed version of the pattern, but with a `v8i16` instead of `v4i32` type:

  def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),                                                                                                                              
                   (i32 (int_aarch64_neon_sqrdmulh                                                                                                                    
                     (i32 FPR32Op:$Rn),                                                                                                                               
                     (i32 (vector_extract (v8i16 V128:$Rm),                                                                                                           
                                          VectorIndexH:$idx)))))),  

But the pattern to generate is even uglier still. This is the best i've got:

  (COPY_TO_REGCLASS (f32 (INSERT_SUBREG (IMPLICIT_DEF),                                                                                                       
                 (!cast<Instruction>(NAME#"i16_indexed")                                                                                                      
                   (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rd, FPR32)), hsub),                                                                        
                   (EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rn, FPR32)), hsub),                                                                        
                   V128:$Rm, VectorIndexH:$idx),                                                                                                              
                 hsub)), FPR32)>;

All the operands are going to be i32 types, so we need to make sure they're in the FPR32 register bank before we try and take the "hsub" subregister from them. That's the `COPY_TO_REGCLASS` nodes. We will then end up with an `f32` type, which in order to merge it into the `i32` the pattern must return, I've added another COPY_TO_REGCLASS so the return value of the entire pattern is merely `FPR32` (which both i32 and f32 can be allocated to).

This produces:

   fmov    s1, w1
  fmov    s2, w0
  sqrdmlah        h2, h1, v0.h[3]
  fmov    w0, s2
  ret

Which is what we want. It can also produce chained sqrdmlah's, such as:

  fmov    s1, w1
  fmov    s2, w0
  sqrdmlah        h2, h1, v0.h[3]
  sqrdmlah        h2, h1, v0.h[2]
  fmov    w0, s2
  ret

So I think this is certainly a valid way of implementing those intrinsics.

I'm not sure the best way forward here - @Tim, would you mind please checking my tomfoolery above and see if you agree or not? If so, implementing these is quite involved so possibly would be better done in a separate patch.

Cheers,

James


REPOSITORY
  rL LLVM

http://reviews.llvm.org/D8502

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list