[PATCH] D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv

Tue Jun 15 02:59:54 PDT 2021

jaykang10 added a comment.

In D104236#2818622 <https://reviews.llvm.org/D104236#2818622>, @jaykang10 wrote:

> In D104236#2817995 <https://reviews.llvm.org/D104236#2817995>, @dmgreen wrote:
>
>> Can we add the other types too? It's good to add all the varieties if we can.
>>
>> Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?
>
> Yep, let me try to add them.

um... I have tried below patterns.

  def : Pat<(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),
            (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), hsub)>;
  def : Pat<(v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),
            (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv16i8v V128:$op), hsub)>;

The tests are as follows.

  declare <4 x i16>  @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone
  declare <8 x i16>  @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone

  declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone
  declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone

  define i16 @addv4h(<8 x i8>* %A) nounwind {
          %tmp1 = load <8 x i8>, <8 x i8>* %A
          %tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)
          %tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)
          ret i16 %tmp5
  }

  define i16 @addv8h(<16 x i8>* %A) nounwind {
          %tmp1 = load <16 x i8>, <16 x i8>* %A
          %tmp3 = call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %tmp1)
          %tmp5 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %tmp3)
          ret i16 %tmp5
  }

Before instruction selection, the ISelDAG of addv4h is as below.

  Optimized legalized selection DAG: %bb.0 'addv4h:'
  SelectionDAG has 14 nodes:
    t0: ch = EntryToken
                t2: i64,ch = CopyFromReg t0, Register:i64 %0
              t5: v8i8,ch = load<(load 8 from %ir.A)> t0, t2, undef:i64
            t20: v4i16 = AArch64ISD::UADDLP t5
          t14: v4i16 = AArch64ISD::UADDV t20
        t18: v8i16 = insert_subvector undef:v8i16, t14, Constant:i64<0>
      t19: i32 = extract_vector_elt t18, Constant:i64<0>
    t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
    t12: ch = AArch64ISD::RET_FLAG t11, Register:i32 $w0, t11:1

The `insert_subvector` causes to select other pattern as below rather than above pattern which I define.

  def : Pat<(i32 (vector_extract (insert_subvector undef,
              (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
              hsub), ssub)>;

In the end,  the add4h selects instruction as below.

  Selected selection DAG: %bb.0 'addv4h:'
  SelectionDAG has 15 nodes:
    t0: ch = EntryToken
                t2: i64,ch = CopyFromReg t0, Register:i64 %0
              t5: v8i8,ch = LDRDui<Mem:(load 8 from %ir.A)> t2, TargetConstant:i64<0>, t0
            t20: v4i16 = UADDLPv8i8_v4i16 t5
          t22: bf16 = ADDVv4i16v t20
        t24: v4i16 = INSERT_SUBREG IMPLICIT_DEF:v4i16, t22, TargetConstant:i32<7>
      t19: i32 = EXTRACT_SUBREG t24, TargetConstant:i32<14>
    t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
    t12: ch = RET_ReallyLR Register:i32 $w0, t11, t11:1

Other types for addv have similar situation except `v4i32`. They are selected by one of below patterns.

  multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
                                      SDPatternOperator opNode> {
  ...
  // If none did, fallback to the explicit patterns, consuming the vector_extract.
  def : Pat<(i32 (vector_extract (insert_subvector undef, (v8i8 (opNode V64:$Rn)),
              (i64 0)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn),
              bsub), ssub)>;
  def : Pat<(i32 (vector_extract (v16i8 (opNode V128:$Rn)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn),
              bsub), ssub)>;
  def : Pat<(i32 (vector_extract (insert_subvector undef,
              (v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
              hsub), ssub)>;
  def : Pat<(i32 (vector_extract (v8i16 (opNode V128:$Rn)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn),
              hsub), ssub)>;
  def : Pat<(i32 (vector_extract (v4i32 (opNode V128:$Rn)), (i64 0))),
            (EXTRACT_SUBREG (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
              (!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn),
              ssub), ssub)>;

  } 

The `v4i32` is selected by a pattern which I defined for addv because there is a pattern for `extractelt` with `i32`.

  // Extracting lane zero is a special case where we can just use a plain
  // EXTRACT_SUBREG instruction, which will become FMOV. This is easier for the
  // rest of the compiler, especially the register allocator and copy propagation,
  // to reason about, so is preferred when it's possible to use it.
  let AddedComplexity = 10 in {
    def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;
    def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;
    def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;
  }

In order to support other types of `addv` pattern, we could need to touch existing patterns... How do you think about it? @dmgreen If possible, I would not like to touch existing patterns... If I missed something, please let me know.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104236/new/

https://reviews.llvm.org/D104236