[PATCH] D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv
JinGu Kang via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Jun 15 02:59:54 PDT 2021
jaykang10 added a comment.
In D104236#2818622 <https://reviews.llvm.org/D104236#2818622>, @jaykang10 wrote:
> In D104236#2817995 <https://reviews.llvm.org/D104236#2817995>, @dmgreen wrote:
>
>> Can we add the other types too? It's good to add all the varieties if we can.
>>
>> Maybe they can be tested with intrinsics? vecreduce of neon.uaddlp?
>
> Yep, let me try to add them.
um... I have tried below patterns.
def : Pat<(v4i16 (AArch64uaddv (v4i16 (AArch64uaddlp (v8i8 V64:$op))))),
(INSERT_SUBREG (v4i16 (IMPLICIT_DEF)), (UADDLVv4i16v V64:$op), hsub)>;
def : Pat<(v8i16 (AArch64uaddv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv16i8v V128:$op), hsub)>;
The tests are as follows.
declare <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8>) nounwind readnone
declare <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8>) nounwind readnone
declare i16 @llvm.vector.reduce.add.v4i16(<4 x i16>) nounwind readnone
declare i16 @llvm.vector.reduce.add.v8i16(<8 x i16>) nounwind readnone
define i16 @addv4h(<8 x i8>* %A) nounwind {
%tmp1 = load <8 x i8>, <8 x i8>* %A
%tmp3 = call <4 x i16> @llvm.aarch64.neon.uaddlp.v4i16.v8i8(<8 x i8> %tmp1)
%tmp5 = call i16 @llvm.vector.reduce.add.v4i16(<4 x i16> %tmp3)
ret i16 %tmp5
}
define i16 @addv8h(<16 x i8>* %A) nounwind {
%tmp1 = load <16 x i8>, <16 x i8>* %A
%tmp3 = call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %tmp1)
%tmp5 = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> %tmp3)
ret i16 %tmp5
}
Before instruction selection, the ISelDAG of addv4h is as below.
Optimized legalized selection DAG: %bb.0 'addv4h:'
SelectionDAG has 14 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: v8i8,ch = load<(load 8 from %ir.A)> t0, t2, undef:i64
t20: v4i16 = AArch64ISD::UADDLP t5
t14: v4i16 = AArch64ISD::UADDV t20
t18: v8i16 = insert_subvector undef:v8i16, t14, Constant:i64<0>
t19: i32 = extract_vector_elt t18, Constant:i64<0>
t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
t12: ch = AArch64ISD::RET_FLAG t11, Register:i32 $w0, t11:1
The `insert_subvector` causes to select other pattern as below rather than above pattern which I define.
def : Pat<(i32 (vector_extract (insert_subvector undef,
(v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
hsub), ssub)>;
In the end, the add4h selects instruction as below.
Selected selection DAG: %bb.0 'addv4h:'
SelectionDAG has 15 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: v8i8,ch = LDRDui<Mem:(load 8 from %ir.A)> t2, TargetConstant:i64<0>, t0
t20: v4i16 = UADDLPv8i8_v4i16 t5
t22: bf16 = ADDVv4i16v t20
t24: v4i16 = INSERT_SUBREG IMPLICIT_DEF:v4i16, t22, TargetConstant:i32<7>
t19: i32 = EXTRACT_SUBREG t24, TargetConstant:i32<14>
t11: ch,glue = CopyToReg t0, Register:i32 $w0, t19
t12: ch = RET_ReallyLR Register:i32 $w0, t11, t11:1
Other types for addv have similar situation except `v4i32`. They are selected by one of below patterns.
multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
SDPatternOperator opNode> {
...
// If none did, fallback to the explicit patterns, consuming the vector_extract.
def : Pat<(i32 (vector_extract (insert_subvector undef, (v8i8 (opNode V64:$Rn)),
(i64 0)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v8i8 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v8i8v")) V64:$Rn),
bsub), ssub)>;
def : Pat<(i32 (vector_extract (v16i8 (opNode V128:$Rn)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v16i8v")) V128:$Rn),
bsub), ssub)>;
def : Pat<(i32 (vector_extract (insert_subvector undef,
(v4i16 (opNode V64:$Rn)), (i64 0)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v4i16 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v4i16v")) V64:$Rn),
hsub), ssub)>;
def : Pat<(i32 (vector_extract (v8i16 (opNode V128:$Rn)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v8i16v")) V128:$Rn),
hsub), ssub)>;
def : Pat<(i32 (vector_extract (v4i32 (opNode V128:$Rn)), (i64 0))),
(EXTRACT_SUBREG (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
(!cast<Instruction>(!strconcat(baseOpc, "v4i32v")) V128:$Rn),
ssub), ssub)>;
}
The `v4i32` is selected by a pattern which I defined for addv because there is a pattern for `extractelt` with `i32`.
// Extracting lane zero is a special case where we can just use a plain
// EXTRACT_SUBREG instruction, which will become FMOV. This is easier for the
// rest of the compiler, especially the register allocator and copy propagation,
// to reason about, so is preferred when it's possible to use it.
let AddedComplexity = 10 in {
def : Pat<(i64 (extractelt (v2i64 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, dsub)>;
def : Pat<(i32 (extractelt (v4i32 V128:$V), (i64 0))), (EXTRACT_SUBREG V128:$V, ssub)>;
def : Pat<(i32 (extractelt (v2i32 V64:$V), (i64 0))), (EXTRACT_SUBREG V64:$V, ssub)>;
}
In order to support other types of `addv` pattern, we could need to touch existing patterns... How do you think about it? @dmgreen If possible, I would not like to touch existing patterns... If I missed something, please let me know.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D104236/new/
https://reviews.llvm.org/D104236
More information about the llvm-commits
mailing list