[PATCH] D53865: [LoopVectorizer] Improve computation of scalarization overhead.

Thu Nov 29 02:08:28 PST 2018

jonpa added a comment.

> Making better decisions what to vectorize and what to keep scalar is clearly useful enough to include in the loop vectorizer. However, this should best be done in a target independent way; e.g., how computePredInstDiscount() and sinkScalarOperands() work to expand the scope of scalarized instructions according to the cumulative cost discount of potentially scalarized instruction chains. Unless there's a good reason for it to be target specific(?)

The only target-specific part I am thinking about is which instructions will later be expanded during *isel*.

> My question back to you is why Scalars is not good enough for your purpose. You get different "scalarlization" answer in collectLoopScalars() and collectTargetScalarized()?

My understanding is that currently the LoopVectorizer notion of a scalarized instruction refers to an *LLVM I/R* scalarized instruction. In other words, which instructions it will itself produce scalarized. These are the instructions contained in Scalars[VF].

As en example, consider this loop:

  define void @fun(i64 %NumIters, float* %Ptr1, float* %Ptr2, float* %Dst) {
  entry:
    br label %for.body

  for.body:
    %IV  = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
    %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
    %LD1 = load float, float* %GEP1
    %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
    %LD2 = load float, float* %GEP2
    %mul = fmul float %LD1, %LD2
    %add = fadd float %mul, %LD2
    store float %add, float* %GEP1
    %IVNext = add nuw nsw i64 %IV, 1
    %exitcond = icmp eq i64 %IVNext, %NumIters
    br i1 %exitcond, label %exit, label %for.body

  exit:
    ret void
  }

This loop is interesting because on z13 vector float operations are not supported, so they are expanded during instruction selection to scalar instructions. If forced to vectorize (even though costs would normally prevent it) with

  clang -S -o - -O3 -march=z13 tc_targscal.ll -mllvm -unroll-count=1 -mllvm -debug-only=loop-vectorize -mllvm -force-vector-width=4

, the loop vectorzer produces this loop:

  vector.body:                                      ; preds = %vector.body, %vector.ph
    %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
    %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
    %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
    %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
    %0 = add i64 %index, 0
    %1 = getelementptr inbounds float, float* %Ptr1, i64 %0
    %2 = getelementptr inbounds float, float* %1, i32 0
    %3 = bitcast float* %2 to <4 x float>*
    %wide.load = load <4 x float>, <4 x float>* %3, align 4, !alias.scope !0, !noalias !3
    %4 = getelementptr inbounds float, float* %Ptr2, i64 %0
    %5 = getelementptr inbounds float, float* %4, i32 0
    %6 = bitcast float* %5 to <4 x float>*
    %wide.load6 = load <4 x float>, <4 x float>* %6, align 4, !alias.scope !3
    %7 = fmul <4 x float> %wide.load, %wide.load6
    %8 = fadd <4 x float> %wide.load6, %7
    %9 = bitcast float* %2 to <4 x float>*
    store <4 x float> %8, <4 x float>* %9, align 4, !alias.scope !0, !noalias !3
    %index.next = add i64 %index, 4
    %10 = icmp eq i64 %index.next, %n.vec
    br i1 %10, label %middle.block, label %vector.body, !llvm.loop !5

The cost computation looked like:

  LV: Found an estimated cost of 0 for VF 4 For instruction:   %IV = phi i64 [ 0, %entry ], [ %IVNext, %for.body ]
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP1 = getelementptr inbounds float, float* %Ptr1, i64 %IV
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD1 = load float, float* %GEP1, align 4
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %GEP2 = getelementptr inbounds float, float* %Ptr2, i64 %IV
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %LD2 = load float, float* %GEP2, align 4
  LV: Found an estimated cost of 16 for VF 4 For instruction:   %mul = fmul float %LD1, %LD2
  LV: Found an estimated cost of 16 for VF 4 For instruction:   %add = fadd float %LD2, %mul
  LV: Found an estimated cost of 1 for VF 4 For instruction:   store float %add, float* %GEP1, align 4
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %IVNext = add nuw nsw i64 %IV, 1
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond = icmp eq i64 %IVNext, %NumIters
  LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond, label %exit, label %for.body

So the costs for the vectorized float operations have been calculated by the target as 2x4 extracts + 4 mul/add + 4 inserts = 16. The loop vectorizer has produced vector instructions, and as far as it is concerned, that's what they are.

However, the assembly output looks like:

  .LBB0_4:                                # %vector.body
                                          # =>This Inner Loop Header: Depth=1
          vl      %v2, 0(%r1,%r3)
          vl      %v3, 0(%r1,%r4)
          vrepf   %v0, %v3, 1
          vrepf   %v1, %v2, 1
          vrepf   %v4, %v2, 2
          vrepf   %v5, %v2, 3
          vrepf   %v6, %v3, 2
          vrepf   %v7, %v3, 3
          meebr   %f1, %f0
          meebr   %f2, %f3
          meebr   %f4, %f6
          meebr   %f5, %f7
          aebr    %f5, %f7
          aebr    %f4, %f6
          aebr    %f2, %f3
          aebr    %f1, %f0
          aghi    %r5, -4
          vmrhf   %v4, %v4, %v5
          vmrhf   %v0, %v2, %v1
          vmrhg   %v0, %v0, %v4
          vst     %v0, 0(%r1,%r3)
          la      %r1, 16(%r1)
          jne     .LBB0_4

, which is 2 x Vector Load + 6 extracts (the fp element 0 overlaps the vector register and does not need an extract) + 4 fp-multiply (meebr) + 4 fp-add + 3 inserts + a Vector Store.

There is no need to insert and extract into a vector register between meebr and aebr. This is where the costs of 16 are wrong - they should have been less.

My question then is how to fix this? My idea was to let the loop vectorizer keep thinking about these instructions as vectorized, while being aware of a later expansion during isel (perhaps ISelExpanded would be a better name than TargetScalarized?).

This could be used to compute better scalarization overhead costs. It would also be interesting at least on SystemZ to use to do scalar stores/loads instead of widening if the def / user is "isel-expanded". This could perhaps be controlled by TTI.supportsEfficientVectorElementLoadStore(), which is already available.

Is this making sense?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53865/new/

https://reviews.llvm.org/D53865