[PATCH] D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
JinGu Kang via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Feb 10 02:37:04 PST 2022
jaykang10 added a comment.
In D118979#3308289 <https://reviews.llvm.org/D118979#3308289>, @sdesmalen wrote:
> I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an `add <4 x i64> %x, %y` will be legalized into two adds. Conceptually this is similar to vectorizing with `<2 x i64>` and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.
>
> Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?
Thanks for comment @sdesmalen!
Let's see a code snippet.
int test(int start, int size, char *src, char *dst) {
int res = 0;
for (int i = start; i < size; ++i) {
res += *dst ^ *src;
dst++;
src++;
}
return res;
}
The assembly output of the vectorized loop is as below.
without this patch --> VF 4 is selected.
.LBB0_5: // %vector.body
// =>This Inner Loop Header: Depth=1
ldp s3, s4, [x12, #-4]
ldp s5, s6, [x8, #-4]
add x8, x8, #8
add x12, x12, #8
subs x13, x13, #8
ushll v3.8h, v3.8b, #0
ushll v4.8h, v4.8b, #0
ushll v5.8h, v5.8b, #0
ushll v6.8h, v6.8b, #0
eor v3.8b, v5.8b, v3.8b
eor v4.8b, v6.8b, v4.8b
ushll v3.4s, v3.4h, #0
ushll v4.4s, v4.4h, #0
and v3.16b, v3.16b, v1.16b
and v4.16b, v4.16b, v1.16b
add v0.4s, v0.4s, v3.4s
add v2.4s, v2.4s, v4.4s
b.ne .LBB0_5
with this patch --> VF 16 is selected
.LBB0_5: // %vector.body
// =>This Inner Loop Header: Depth=1
ldp q16, q18, [x12, #-16]
add x12, x12, #32
subs x13, x13, #32
ldp q17, q19, [x8, #-16]
add x8, x8, #32
eor v16.16b, v17.16b, v16.16b
eor v17.16b, v19.16b, v18.16b
ushll2 v18.8h, v16.16b, #0
ushll v16.8h, v16.8b, #0
ushll v19.8h, v17.8b, #0
ushll2 v17.8h, v17.16b, #0
uaddw2 v2.4s, v2.4s, v18.8h
uaddw v1.4s, v1.4s, v18.4h
uaddw2 v3.4s, v3.4s, v16.8h
uaddw v0.4s, v0.4s, v16.4h
uaddw2 v6.4s, v6.4s, v17.8h
uaddw v5.4s, v5.4s, v17.4h
uaddw2 v7.4s, v7.4s, v19.8h
uaddw v4.4s, v4.4s, v19.4h
b.ne .LBB0_5
We can see the uaddw instructions on the output with `VF=16`. AArch64 has below pattern definition and it is selected for the uaddw.
multiclass SIMDWideThreeVectorBHS<bit U, bits<4> opc, string asm,
SDPatternOperator OpNode> {
...
def v4i16_v4i32 : BaseSIMDDifferentThreeVector<U, 0b010, opc,
V128, V128, V64,
asm, ".4s", ".4s", ".4h",
[(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn), (v4i16 V64:$Rm)))]>;
def v8i16_v4i32 : BaseSIMDDifferentThreeVector<U, 0b011, opc,
V128, V128, V128,
asm#"2", ".4s", ".4s", ".8h",
[(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn),
(extract_high_v8i16 V128:$Rm)))]>;
...
defm UADDW : SIMDWideThreeVectorBHS<1, 0b0001, "uaddw",
BinOpFrag<(add node:$LHS, (zanyext node:$RHS))>>;
Given the number of instructions, we could expect the loop handles almost 4 times more data per iteration ideally.
As @dmgreen mentioned, we are seeing some performance degradations. In dave's case, it looks the LV generates shuffle vectors and it blocks to lower the MUL to SMULL. As an other case, if LV detects interleaved group, it generates shuffle vectors with big number of elements. The shuffle vectors cause lots of `mov` instructions. Maybe, there are more cases for the performance degradation but it could show us more opportunities to get better performance score. That's what I want from this patch...
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D118979/new/
https://reviews.llvm.org/D118979
More information about the llvm-commits
mailing list