[PATCH] D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth

Thu Feb 10 02:37:04 PST 2022

jaykang10 added a comment.

In D118979#3308289 <https://reviews.llvm.org/D118979#3308289>, @sdesmalen wrote:

> I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an `add <4 x i64> %x, %y` will be legalized into two adds. Conceptually this is similar to vectorizing with `<2 x i64>` and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.
>
> Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

Thanks for comment @sdesmalen!

Let's see a code snippet.

  int test(int start, int size, char *src, char *dst) {
    int res = 0;
    for (int i = start; i < size; ++i) {
      res += *dst ^ *src;
      dst++;
      src++;
    }

    return res;
  }

The assembly output of the vectorized loop is as below.

  without this patch  --> VF 4 is selected.
  .LBB0_5:                                // %vector.body
                                          // =>This Inner Loop Header: Depth=1
  	ldp	s3, s4, [x12, #-4]
  	ldp	s5, s6, [x8, #-4]
  	add	x8, x8, #8
  	add	x12, x12, #8
  	subs	x13, x13, #8
  	ushll	v3.8h, v3.8b, #0
  	ushll	v4.8h, v4.8b, #0
  	ushll	v5.8h, v5.8b, #0
  	ushll	v6.8h, v6.8b, #0
  	eor	v3.8b, v5.8b, v3.8b
  	eor	v4.8b, v6.8b, v4.8b
  	ushll	v3.4s, v3.4h, #0
  	ushll	v4.4s, v4.4h, #0
  	and	v3.16b, v3.16b, v1.16b
  	and	v4.16b, v4.16b, v1.16b
  	add	v0.4s, v0.4s, v3.4s
  	add	v2.4s, v2.4s, v4.4s
  	b.ne	.LBB0_5

  with this patch  --> VF 16 is selected
  .LBB0_5:                                // %vector.body
                                          // =>This Inner Loop Header: Depth=1
  	ldp	q16, q18, [x12, #-16]
  	add	x12, x12, #32
  	subs	x13, x13, #32
  	ldp	q17, q19, [x8, #-16]
  	add	x8, x8, #32
  	eor	v16.16b, v17.16b, v16.16b
  	eor	v17.16b, v19.16b, v18.16b
  	ushll2	v18.8h, v16.16b, #0
  	ushll	v16.8h, v16.8b, #0
  	ushll	v19.8h, v17.8b, #0
  	ushll2	v17.8h, v17.16b, #0
  	uaddw2	v2.4s, v2.4s, v18.8h
  	uaddw	v1.4s, v1.4s, v18.4h
  	uaddw2	v3.4s, v3.4s, v16.8h
  	uaddw	v0.4s, v0.4s, v16.4h
  	uaddw2	v6.4s, v6.4s, v17.8h
  	uaddw	v5.4s, v5.4s, v17.4h
  	uaddw2	v7.4s, v7.4s, v19.8h
  	uaddw	v4.4s, v4.4s, v19.4h
  	b.ne	.LBB0_5

We can see the uaddw instructions on the output with `VF=16`.  AArch64 has below pattern definition and it is selected for the uaddw.

  multiclass SIMDWideThreeVectorBHS<bit U, bits<4> opc, string asm,
                                    SDPatternOperator OpNode> {
  ...
    def v4i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b010, opc,
                                                    V128, V128, V64,
                                                    asm, ".4s", ".4s", ".4h",
         [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn), (v4i16 V64:$Rm)))]>;
    def v8i16_v4i32  : BaseSIMDDifferentThreeVector<U, 0b011, opc,
                                                    V128, V128, V128, 
                                                    asm#"2", ".4s", ".4s", ".8h",
         [(set (v4i32 V128:$Rd), (OpNode (v4i32 V128:$Rn),
                                         (extract_high_v8i16 V128:$Rm)))]>;
  ...
  defm UADDW   : SIMDWideThreeVectorBHS<1, 0b0001, "uaddw",
                   BinOpFrag<(add node:$LHS, (zanyext node:$RHS))>>;

Given the number of instructions, we could expect the loop handles almost 4 times more data per iteration ideally.
As @dmgreen mentioned, we are seeing some performance degradations. In dave's case, it looks the LV generates shuffle vectors and it blocks to lower the MUL to SMULL. As an other case, if LV detects interleaved group, it generates shuffle vectors with big number of elements. The shuffle vectors cause lots of `mov` instructions. Maybe, there are more cases for the performance degradation but it could show us more opportunities to get better performance score. That's what I want from this patch...

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118979/new/

https://reviews.llvm.org/D118979