[PATCH] D143624: Inlining: Run the legacy AlwaysInliner before the regular inliner.

Mon May 1 17:16:40 PDT 2023

aemerson added a comment.
Herald added a subscriber: jplehr.

@dmgreen I've been looking at this test again trying to see what's missing. The problem now is that only a VF of 4 is chosen. In the good case, instcombine/simplifyCFG runs so that it simplifies down to an `smin` intrinsic. After this change `__SSAT()` is inlined first. We then have:

  target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
  target triple = "aarch64-linux-gnu"

  define void @arm_mult_q15(ptr %pSrcA, ptr %pSrcB, ptr noalias %pDst, i32 %blockSize) {
  entry:
    br label %while.cond

  while.cond:                                       ; preds = %while.body, %entry
    %pSrcB.addr.0 = phi ptr [ %pSrcB, %entry ], [ %incdec.ptr1, %while.body ]
    %pDst.addr.0 = phi ptr [ %pDst, %entry ], [ %incdec.ptr4, %while.body ]
    %pSrcA.addr.0 = phi ptr [ %pSrcA, %entry ], [ %incdec.ptr, %while.body ]
    %blkCnt.0 = phi i32 [ %blockSize, %entry ], [ %dec, %while.body ]
    %cmp.not = icmp eq i32 %blkCnt.0, 0
    br i1 %cmp.not, label %while.end, label %while.body

  while.body:                                       ; preds = %while.cond
    %incdec.ptr = getelementptr inbounds i16, ptr %pSrcA.addr.0, i32 1
    %0 = load i16, ptr %pSrcA.addr.0, align 2
    %conv = sext i16 %0 to i32
    %incdec.ptr1 = getelementptr inbounds i16, ptr %pSrcB.addr.0, i32 1
    %1 = load i16, ptr %pSrcB.addr.0, align 2
    %conv2 = sext i16 %1 to i32
    %mul = mul nsw i32 %conv, %conv2
    %shr = ashr i32 %mul, 15
    %cmp4.i = icmp sgt i32 %shr, 32767
    %switch.i = icmp ult i1 %cmp4.i, true
    %spec.select.i = select i1 %switch.i, i32 %shr, i32 32767
    %conv3 = trunc i32 %spec.select.i to i16
    %incdec.ptr4 = getelementptr inbounds i16, ptr %pDst.addr.0, i32 1
    store i16 %conv3, ptr %pDst.addr.0, align 2
    %dec = add i32 %blkCnt.0, -1
    br label %while.cond

  while.end:                                        ; preds = %while.cond
    ret void
  }

These instructions are from the callee that should now be combined into `smin`:

  %cmp4.i = icmp sgt i32 %shr, 32767
  %switch.i = icmp ult i1 %cmp4.i, true
  %spec.select.i = select i1 %switch.i, i32 %shr, i32 32767

... except due to the surrounding instructions, the first icmp is optimized into 
`icmp sgt i32 %mul, 1073741823` by `InstCombinerImpl::foldICmpInstWithConstant()`

This breaks the smin recognition. I'm not sure what the best approach is to fix this. InstCombine already has this chunk of code to try to avoid messing with compares that might form min/max patterns but it expects further simplification to fire:

  // Test if the ICmpInst instruction is used exclusively by a select as
  // part of a minimum or maximum operation. If so, refrain from doing
  // any other folding. This helps out other analyses which understand
  // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
  // and CodeGen. And in this case, at least one of the comparison
  // operands has at least one user besides the compare (the select),
  // which would often largely negate the benefit of folding anyway.
  //
  // Do the same for the other patterns recognized by matchSelectPattern.
  if (I.hasOneUse())
    if (SelectInst *SI = dyn_cast<SelectInst>(I.user_back())) {
      Value *A, *B;
      SelectPatternResult SPR = matchSelectPattern(SI, A, B);
      if (SPR.Flavor != SPF_UNKNOWN)
        return nullptr;
    }

Any ideas? I'd really like to get this inliner change in because it's fundamentally a good change to have.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143624/new/

https://reviews.llvm.org/D143624