[PATCH] D114779: [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`

Thu Feb 10 00:38:12 PST 2022

dmgreen added a comment.

OK. I think the problems fit into 4 or maybe 5 different categories.

First up - there were some very large downstream regressions we had which certainly fit into the target-dependant bucket. We were allowing tail folding but no masked gather scatter, which was causing some very bad codegen from scalarized loads/stores. That's was simple enough to fix on Tuesday though by disallowing the tail folding in those cases. I would still hope that the cost model would handle it, but at least those problems are no more.

The next two most obvious from the tests here are with optsize and with low trip counts. The optsize tests in LoopVectorize/optsize.ll look both much larger, and much slower to me with all that branching. Not something that you ideally would want to do at Os. The tests in LoopVectorize/tripcount.ll also look worse to me, and is similar to one of the reports I got about this patch. A loop with a very low trip count is often not worth vectorizing, especially so if it is going to produce very inefficient predicated branching code.

The code in LoopVectorize/AArch64/tail-fold-uniform-memops.ll also looks worse to me. It's difficult to see why a target with a gather/scatter should choose to use predicated scalarized load/stores instead. But I haven't looked into the details. Perhaps that one fits into the target-indepentant cost model going wrong, but whatever it it sounds like something that should be fixed before we remove the hack.

Of the other cases I have, one may be similar to the low-trip-count cases. I can't really share the original, but it had a lot of other intrinsic code in it for producing matrix multiplies. There is hopefully a cut-down version here: https://godbolt.org/z/c311Y8j39, where its difficult to see that the vectorized version will be better with all those broadcasts/blends/branches, even if it is doing more per iteration. Some of these required specific targets for the problem to come up - they could easily be hidden by other costs, like the cost of a VF2 mul being 6 under base x86. That one is under skylake, the original was AArch64 and the performance was apparently upto 160% worse, even with all the other code in the original.

The last case is more straight forward with what is getting predicated, but I'm having trouble at the moment seeing why it isn't a problem for any target. The code has some predicated loads, like this: https://godbolt.org/z/E7PdYrn4T. The vectorization seems a lot worse with so many difficult to predict branches.

In that case the vplan it is executing looks like this:

  VPlan 'Initial VPlan for VF={2,4},UF>=1' {
  Live-in vp<%0> = vector-trip-count

  <x1> vector loop: {
    for.body:
      EMIT vp<%1> = CANONICAL-INDUCTION
      WIDEN-INDUCTION %indvars.iv = phi 0, %indvars.iv.next
      WIDEN-REDUCTION-PHI ir<%nz.055> = phi ir<0>, ir<%or>
      CLONE ir<%arrayidx> = getelementptr ir<%dct>, ir<%indvars.iv>
      WIDEN ir<%0> = load ir<%arrayidx>
      WIDEN ir<%conv> = sext ir<%0>
      WIDEN ir<%cmp1> = icmp ir<%0>, ir<0>
      CLONE ir<%arrayidx4> = getelementptr ir<%bias>, ir<%indvars.iv>
      WIDEN ir<%1> = load ir<%arrayidx4>
      WIDEN ir<%conv5> = zext ir<%1>
    Successor(s): if.else

    if.else:
      WIDEN ir<%sub> = sub ir<%conv5>, ir<%conv>
      EMIT vp<%12> = not ir<%cmp1>
    Successor(s): pred.load

    <xVFxUF> pred.load: {
      pred.load.entry:
        BRANCH-ON-MASK vp<%12>
      Successor(s): pred.load.if, pred.load.continue
      CondBit: vp<%12> (if.else)

      pred.load.if:
        REPLICATE ir<%arrayidx22> = getelementptr ir<%mf>, ir<%indvars.iv>
        REPLICATE ir<%4> = load ir<%arrayidx22> (S->V)
      Successor(s): pred.load.continue

      pred.load.continue:
        PHI-PREDICATED-INSTRUCTION vp<%15> = ir<%4>
      No successors
    }
    Successor(s): if.else.0

    if.else.0:
      WIDEN ir<%conv23> = zext vp<%15>
      WIDEN ir<%mul24> = mul ir<%sub>, ir<%conv23>
      WIDEN ir<%5> = lshr ir<%mul24>, ir<16>
      WIDEN ir<%6> = trunc ir<%5>
      WIDEN ir<%conv27> = sub ir<0>, ir<%6>
    Successor(s): if.then

    if.then:
      WIDEN ir<%add> = add ir<%conv5>, ir<%conv>
    Successor(s): pred.load

    <xVFxUF> pred.load: {
      pred.load.entry:
        BRANCH-ON-MASK ir<%cmp1>
      Successor(s): pred.load.if, pred.load.continue
      CondBit: ir<%cmp1>

      pred.load.if:
        REPLICATE ir<%arrayidx10> = getelementptr ir<%mf>, ir<%indvars.iv>
        REPLICATE ir<%2> = load ir<%arrayidx10> (S->V)
      Successor(s): pred.load.continue

      pred.load.continue:
        PHI-PREDICATED-INSTRUCTION vp<%24> = ir<%2>
      No successors
    }
    Successor(s): if.then.0

    if.then.0:
      WIDEN ir<%conv11> = zext vp<%24>
      WIDEN ir<%mul> = mul ir<%add>, ir<%conv11>
      WIDEN ir<%3> = lshr ir<%mul>, ir<16>
      WIDEN ir<%conv12> = trunc ir<%3>
    Successor(s): if.end

    if.end:
      BLEND %storemerge = ir<%conv27>/vp<%12> ir<%conv12>/ir<%cmp1>
      WIDEN store ir<%arrayidx>, ir<%storemerge>
      WIDEN ir<%conv32> = sext ir<%storemerge>
      WIDEN ir<%or> = or ir<%nz.055>, ir<%conv32>
      EMIT vp<%33> = VF * UF +(nuw)  vp<%1>
      EMIT branch-on-count  vp<%33> vp<%0>
    No successors
  }

The costs look like:

  LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %nz.055 = phi i32 [ 0, %entry ], [ %or, %if.end ]
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx = getelementptr inbounds i16, i16* %dct, i64 %indvars.iv
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %0 = load i16, i16* %arrayidx, align 2, !tbaa !3
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv = sext i16 %0 to i32
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %cmp1 = icmp sgt i16 %0, 0
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx4 = getelementptr inbounds i16, i16* %bias, i64 %indvars.iv
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %1 = load i16, i16* %arrayidx4, align 2, !tbaa !3
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv5 = zext i16 %1 to i32
  LV: Found an estimated cost of 4 for VF 4 For instruction:   br i1 %cmp1, label %if.then, label %if.else
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %sub = sub nsw i32 %conv5, %conv
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx22 = getelementptr inbounds i16, i16* %mf, i64 %indvars.iv
  LV: Found an estimated cost of 4 for VF 4 For instruction:   %4 = load i16, i16* %arrayidx22, align 2, !tbaa !3
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv23 = zext i16 %4 to i32
  LV: Found an estimated cost of 2 for VF 4 For instruction:   %mul24 = mul nsw i32 %sub, %conv23
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %5 = lshr i32 %mul24, 16
  LV: Found an estimated cost of 2 for VF 4 For instruction:   %6 = trunc i32 %5 to i16
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv27 = sub i16 0, %6
  LV: Found an estimated cost of 0 for VF 4 For instruction:   br label %if.end
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %add = add nsw i32 %conv5, %conv
  LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx10 = getelementptr inbounds i16, i16* %mf, i64 %indvars.iv
  LV: Found an estimated cost of 4 for VF 4 For instruction:   %2 = load i16, i16* %arrayidx10, align 2, !tbaa !3
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv11 = zext i16 %2 to i32
  LV: Found an estimated cost of 2 for VF 4 For instruction:   %mul = mul nsw i32 %add, %conv11
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %3 = lshr i32 %mul, 16
  LV: Found an estimated cost of 2 for VF 4 For instruction:   %conv12 = trunc i32 %3 to i16
  LV: Found an estimated cost of 0 for VF 4 For instruction:   br label %if.end
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %storemerge = phi i16 [ %conv27, %if.else ], [ %conv12, %if.then ]
  LV: Found an estimated cost of 1 for VF 4 For instruction:   store i16 %storemerge, i16* %arrayidx, align 2, !tbaa !3
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %conv32 = sext i16 %storemerge to i32
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %or = or i32 %nz.055, %conv32
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !7
  LV: Vector loop of width 4 costs: 9.
  LV: Selecting VF: 4.

So the cost of all that extra branching is accounted for in either the `br i1 %cmp1` or the costs of the loads? The cost of any branch is usually 0 in llvm, but adding this many difficult-to-predict branches into an inner loop is going to cause problems, no matter how good the core is at predicting them. The cost of the predicated scalarized loads comes from getMemInstScalarizationCost as far as I understand.

That is what I thought this "useEmulatedMaskMemRefHack" was protecting against - the fact that the costs via getMemInstScalarizationCost for predicated loads/stores wasn't really good enough. And that the vplan for predication can end up quite differently from the original code, but at current the costs are all just added up from the original instructions. I'm surprised this doesn't come up in more cases, to be honest. The cases we had where this was making things worse were not as wide-spread as I would have imagined they would be. But the -Os vectorization and low trip counts are pretty significant regressions.

The real best long term fix for this (that doesn't introduce other hacks) might be to properly implement a vplan-based cost-model in the vectorizer. So that it is really adding up the costs of the things that will be produced by the vectorizer, not trying to guess them from the original instructions. I suspect we may need something to add a cost for all the predicated branching too, although I'm not sure what exactly (what //is// the cost of a branch? :) ) There will be a point where the benefit of vectorization will make some branching profitable, so it would be great to remove the Hack if we can do so.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114779/new/

https://reviews.llvm.org/D114779