[llvm] [AMDGPU] Prefer vector i8s in PHI Nodes (PR #91016)

Sat May 4 06:31:20 PDT 2024

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

<details>
<summary>Changes</summary>

After https://github.com/llvm/llvm-project/pull/66838 , vectors of i8s will be preferable (relative to the scalarized equivalent) as incoming values for PHI nodes. Since this work addresses that concern, I have built it on top of that PR. The changes to AMDGPULateCodeGenPrepare can be ignored in this PR, as this PR is only meant for the other changes. For this work, the changes begin at [2049a61](https://github.com/llvm/llvm-project/pull/91016/commits/2049a61e21850072f85a10461cb51d1f0e0df6cc)

This PR is a draft / WIP, while I work on test coverage and confirm there are no oddities with our current cost model with the inclusion of i8s. However, before taking that plunge, I wanted to see if there were any high level concerns with the approach.

Conceptually, this work vectorizes I8 phis by imposing a cost penalty to the scalarized version. I have also reduced the cost of i8 shuffles, as this facilitates more PHI vectorization and the shuffles should at least be matched to perm.

After the mentioned PR, we will workaround the normal legalization in SelectionDAG for vector i8 PHIs, so in this work we also bypass the legalizeation checks when initially deciding whether or not to vectorize the type (GCNTTIImpl::getNumberOfParts).

---

Patch is 259.98 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91016.diff


18 Files Affected:

- (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+13) 
- (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+4) 
- (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+4) 
- (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+6) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+331-2) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-2) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+43-7) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h (+3) 
- (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+1-1) 
- (modified) llvm/test/Analysis/CostModel/AMDGPU/shufflevector.ll (+108-108) 
- (added) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+427) 
- (added) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-loop-carried.ll (+52) 
- (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+12-12) 
- (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+8-16) 
- (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+176-1757) 
- (added) llvm/test/CodeGen/AMDGPU/vni8-loop-carried.ll (+46) 
- (added) llvm/test/Transforms/SLPVectorizer/AMDGPU/vectorize-i8.ll (+366) 


``````````diff

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 58c69ac939763a..1fccc460ed74ad 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -884,6 +884,10 @@ class TargetTransformInfo {
                                            bool Insert, bool Extract,
                                            TTI::TargetCostKind CostKind) const;
 
+  /// Estimate the overhead of scalarizing a PHI instruction.
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                              VectorType *VTy) const;
+
   /// Estimate the overhead of scalarizing an instructions unique
   /// non-constant operands. The (potentially vector) types to use for each of
   /// argument are passes via Tys.
@@ -1912,6 +1916,9 @@ class TargetTransformInfo::Concept {
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
                                    TargetCostKind CostKind) = 0;
+
+  virtual InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                                      VectorType *VTy) = 0;
   virtual bool supportsEfficientVectorElementLoadStore() = 0;
   virtual bool supportsTailCalls() = 0;
   virtual bool supportsTailCallFor(const CallBase *CB) = 0;
@@ -2441,6 +2448,12 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.getScalarizationOverhead(Ty, DemandedElts, Insert, Extract,
                                          CostKind);
   }
+
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy,
+                                              VectorType *VTy) override {
+    return Impl.getPHIScalarizationOverhead(ScalarTy, VTy);
+  }
+
   InstructionCost
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 5b40e49714069f..66f3b01e0a3cda 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -363,6 +363,10 @@ class TargetTransformInfoImplBase {
     return 0;
   }
 
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+    return 0;
+  }
+
   InstructionCost
   getOperandsScalarizationOverhead(ArrayRef<const Value *> Args,
                                    ArrayRef<Type *> Tys,
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 06a19c75cf873a..4b4101a1e26940 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -801,6 +801,10 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
                                              CostKind);
   }
 
+  InstructionCost getPHIScalarizationOverhead(Type *ScalarTy, VectorType *VTy) {
+    return 0;
+  }
+
   /// Estimate the overhead of scalarizing an instructions unique
   /// non-constant operands. The (potentially vector) types to use for each of
   /// argument are passes via Tys.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 33c899fe889990..e351f3895ad507 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -585,6 +585,12 @@ InstructionCost TargetTransformInfo::getScalarizationOverhead(
                                            CostKind);
 }
 
+InstructionCost
+TargetTransformInfo::getPHIScalarizationOverhead(Type *ScalarTy,
+                                                 VectorType *VTy) const {
+  return TTIImpl->getPHIScalarizationOverhead(ScalarTy, VTy);
+}
+
 InstructionCost TargetTransformInfo::getOperandsScalarizationOverhead(
     ArrayRef<const Value *> Args, ArrayRef<Type *> Tys,
     TTI::TargetCostKind CostKind) const {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index 69fdeaebe0a018..d7d2ebff03b6b7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -81,6 +81,88 @@ class AMDGPULateCodeGenPrepare
   bool visitLoadInst(LoadInst &LI);
 };
 
+class ConversionCandidateInfo {
+private:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *LiveRegDef;
+  // The original type
+  Type *OriginalType;
+  // The desired type
+  Type *NewType;
+  // The instruction sequence that converts the virtual register, to be used
+  // instead of the original
+  Instruction *Converted = nullptr;
+  // The builder used to build the conversion instruction
+  IRBuilder<> ConvertBuilder;
+
+public:
+  // The instruction which defined the original virtual register used across
+  // blocks
+  Instruction *getLiveRegDef() { return LiveRegDef; }
+  // The original type
+  Type *getOriginalType() { return OriginalType; }
+  // The desired type
+  Type *getNewType() { return NewType; }
+  void setNewType(Type *NewType) { this->NewType = NewType; }
+  // The instruction that conerts the virtual register, to be used instead of
+  // the original
+  Instruction *getConverted() { return Converted; }
+  void setConverted(Instruction *Converted) { this->Converted = Converted; }
+  // The builder used to build the conversion instruction
+  IRBuilder<> &getConvertBuilder() { return ConvertBuilder; }
+  // Do we have a instruction sequence which convert the original virtual
+  // register
+  bool hasConverted() { return Converted != nullptr; }
+
+  ConversionCandidateInfo(Instruction *LiveRegDef, BasicBlock *InsertBlock,
+                          BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        ConvertBuilder(InsertBlock, InsertPt) {}
+  ConversionCandidateInfo(Instruction *LiveRegDef, Type *NewType,
+                          BasicBlock *InsertBlock,
+                          BasicBlock::iterator InsertPt)
+      : LiveRegDef(LiveRegDef), OriginalType(LiveRegDef->getType()),
+        NewType(NewType), ConvertBuilder(InsertBlock, InsertPt) {}
+};
+
+typedef std::pair<Instruction *, BasicBlock *> IncomingPair;
+typedef std::pair<Instruction *, SmallVector<IncomingPair, 4>> PHIUpdateInfo;
+
+class LiveRegOptimizer {
+private:
+  Module *Mod = nullptr;
+  const DataLayout *DL = nullptr;
+  // The scalar type to convert to
+  Type *ConvertToScalar;
+  // Holds the collection of PHIs with their pending new operands
+  SmallVector<PHIUpdateInfo, 4> PHIUpdater;
+
+public:
+  // Should the def of the instruction be converted if it is live across blocks
+  bool shouldReplaceUses(const Instruction &I);
+  // Convert the virtual register to the compatible vector of legal type
+  void convertToOptType(ConversionCandidateInfo &LR);
+  // Convert the virtual register back to the original type, stripping away
+  // the MSBs in cases where there was an imperfect fit (e.g. v2i32 -> v7i8)
+  void convertFromOptType(ConversionCandidateInfo &LR);
+  // Get a vector of desired scalar type that is compatible with the original
+  // vector. In cases where there is no bitsize equivalent using a legal vector
+  // type, we pad the MSBs (e.g. v7i8 -> v2i32)
+  Type *getCompatibleType(Instruction *InstToConvert);
+  // Find and replace uses of the virtual register in different block with a
+  // newly produced virtual register of legal type
+  bool replaceUses(Instruction &I);
+  // Replace the collected PHIs with newly produced incoming values. Replacement
+  // is only done if we have a replacement for each original incoming value.
+  bool replacePHIs();
+
+  LiveRegOptimizer(Module *Mod) : Mod(Mod) {
+    DL = &Mod->getDataLayout();
+    ConvertToScalar = Type::getInt32Ty(Mod->getContext());
+  }
+};
+
 } // end anonymous namespace
 
 bool AMDGPULateCodeGenPrepare::doInitialization(Module &M) {
@@ -102,14 +184,261 @@ bool AMDGPULateCodeGenPrepare::runOnFunction(Function &F) {
   AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
   UA = &getAnalysis<UniformityInfoWrapperPass>().getUniformityInfo();
 
+  // "Optimize" the virtual regs that cross basic block boundaries. When
+  // building the SelectionDAG, vectors of illegal types that cross basic blocks
+  // will be scalarized and widened, with each scalar living in its
+  // own physical register. To work around this, this optimization converts the
+  // vectors to equivalent vectors of legal type (which are converted back
+  // before uses in subsequent blocks), to pack the bits into fewer physical
+  // registers (used in CopyToReg/CopyFromReg pairs).
+  LiveRegOptimizer LRO(Mod);
+
   bool Changed = false;
   for (auto &BB : F)
-    for (Instruction &I : llvm::make_early_inc_range(BB))
+    for (Instruction &I : make_early_inc_range(BB)) {
       Changed |= visit(I);
+      if (!LRO.shouldReplaceUses(I))
+        continue;
+      Changed |= LRO.replaceUses(I);
+    }
 
+  Changed |= LRO.replacePHIs();
   return Changed;
 }
 
+bool LiveRegOptimizer::replaceUses(Instruction &I) {
+  bool MadeChange = false;
+
+  struct ConvertUseInfo {
+    Instruction *Converted;
+    SmallVector<Instruction *, 4> Users;
+  };
+  DenseMap<BasicBlock *, ConvertUseInfo> InsertedConversionMap;
+
+  ConversionCandidateInfo FromCCI(&I, I.getParent(),
+                                  std::next(I.getIterator()));
+  FromCCI.setNewType(getCompatibleType(FromCCI.getLiveRegDef()));
+  for (auto IUser = I.user_begin(); IUser != I.user_end(); IUser++) {
+
+    Instruction *UserInst = cast<Instruction>(*IUser);
+    if (UserInst->getParent() != I.getParent() || isa<PHINode>(UserInst)) {
+      LLVM_DEBUG(dbgs() << *UserInst << "\n\tUses "
+                        << *FromCCI.getOriginalType()
+                        << " from previous block. Needs conversion\n");
+      convertToOptType(FromCCI);
+      if (!FromCCI.hasConverted())
+        continue;
+      // If it is a PHI node, just create and collect the new operand. We can
+      // only replace the PHI node once we have converted all the operands
+      if (auto PHI = dyn_cast<PHINode>(UserInst)) {
+        for (unsigned Idx = 0; Idx < PHI->getNumIncomingValues(); Idx++) {
+          Value *IncVal = PHI->getIncomingValue(Idx);
+          if (&I == dyn_cast<Instruction>(IncVal)) {
+            BasicBlock *IncBlock = PHI->getIncomingBlock(Idx);
+            auto PHIOps =
+                find_if(PHIUpdater, [&UserInst](PHIUpdateInfo &Entry) {
+                  return Entry.first == UserInst;
+                });
+
+            if (PHIOps == PHIUpdater.end())
+              PHIUpdater.push_back(
+                  {UserInst, {{FromCCI.getConverted(), IncBlock}}});
+            else
+              PHIOps->second.push_back({FromCCI.getConverted(), IncBlock});
+
+            break;
+          }
+        }
+        continue;
+      }
+
+      // Do not create multiple conversion sequences if there are multiple
+      // uses in the same block
+      if (InsertedConversionMap.contains(UserInst->getParent())) {
+        InsertedConversionMap[UserInst->getParent()].Users.push_back(UserInst);
+        LLVM_DEBUG(dbgs() << "\tUser already has access to converted def\n");
+        continue;
+      }
+
+      ConversionCandidateInfo ToCCI(FromCCI.getConverted(), I.getType(),
+                                    UserInst->getParent(),
+
+                                    UserInst->getParent()->getFirstNonPHIIt());
+      convertFromOptType(ToCCI);
+      assert(ToCCI.hasConverted());
+      InsertedConversionMap[UserInst->getParent()] = {ToCCI.getConverted(),
+                                                      {UserInst}};
+    }
+  }
+
+  // Replace uses of with in a separate loop that is not dependent upon the
+  // state of the uses
+  for (auto &Entry : InsertedConversionMap) {
+    for (auto &UserInst : Entry.second.Users) {
+      LLVM_DEBUG(dbgs() << *UserInst
+                        << "\n\tNow uses: " << *Entry.second.Converted << '\n');
+      UserInst->replaceUsesOfWith(&I, Entry.second.Converted);
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+bool LiveRegOptimizer::replacePHIs() {
+  bool MadeChange = false;
+  for (auto Ele : PHIUpdater) {
+    auto [ThePHIInst, NewPHINodeOps] = Ele;
+    LLVM_DEBUG(dbgs() << "Attempting to replace: " << *ThePHIInst << '\n');
+    // If we have conveted all the required operands, then do the replacement
+    if (cast<PHINode>(ThePHIInst)->getNumIncomingValues() ==
+        NewPHINodeOps.size()) {
+      IRBuilder<> Builder(Ele.first);
+      auto NPHI = Builder.CreatePHI(NewPHINodeOps[0].first->getType(),
+                                    NewPHINodeOps.size());
+      for (auto IncVals : NewPHINodeOps) {
+        NPHI->addIncoming(IncVals.first, IncVals.second);
+        LLVM_DEBUG(dbgs() << "  Using: " << *IncVals.first
+                          << "  For: " << IncVals.second->getName() << '\n');
+      }
+      LLVM_DEBUG(dbgs() << "Sucessfully replaced with " << *NPHI << '\n');
+      ConversionCandidateInfo ToCCI(
+          NPHI, ThePHIInst->getType(), ThePHIInst->getParent(),
+
+          ThePHIInst->getParent()->getFirstNonPHIIt());
+      convertFromOptType(ToCCI);
+      assert(ToCCI.hasConverted());
+      Ele.first->replaceAllUsesWith(ToCCI.getConverted());
+      // The old PHI is no longer used
+      ThePHIInst->eraseFromParent();
+      MadeChange = true;
+    }
+  }
+  return MadeChange;
+}
+
+Type *LiveRegOptimizer::getCompatibleType(Instruction *InstToConvert) {
+  Type *OriginalType = InstToConvert->getType();
+  assert(OriginalType->getScalarSizeInBits() <=
+         ConvertToScalar->getScalarSizeInBits());
+  VectorType *VTy = dyn_cast<VectorType>(OriginalType);
+  if (!VTy)
+    return ConvertToScalar;
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+  TypeSize ConvertScalarSize = DL->getTypeSizeInBits(ConvertToScalar);
+  unsigned ConvertEltCount =
+      (OriginalSize + ConvertScalarSize - 1) / ConvertScalarSize;
+
+  if (OriginalSize <= ConvertScalarSize)
+    return IntegerType::get(Mod->getContext(), ConvertScalarSize);
+
+  return VectorType::get(Type::getIntNTy(Mod->getContext(), ConvertScalarSize),
+                         ElementCount::getFixed(ConvertEltCount));
+}
+
+void LiveRegOptimizer::convertToOptType(ConversionCandidateInfo &LR) {
+  if (LR.hasConverted()) {
+    LLVM_DEBUG(dbgs() << "\tAlready has converted def\n");
+    return;
+  }
+
+  VectorType *VTy = cast<VectorType>(LR.getOriginalType());
+  Type *NewTy = LR.getNewType();
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(VTy);
+  TypeSize NewSize = DL->getTypeSizeInBits(NewTy);
+
+  auto &Builder = LR.getConvertBuilder();
+  Value *V = cast<Value>(LR.getLiveRegDef());
+  // If there is a bitsize match, we can fit the old vector into a new vector of
+  // desired type
+  if (OriginalSize == NewSize) {
+    LR.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewTy)));
+    LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+                      << '\n');
+    return;
+  }
+
+  // If there is a bitsize mismatch, we must use a wider vector
+  assert(NewSize > OriginalSize);
+  ElementCount ExpandedVecElementCount =
+      ElementCount::getFixed(NewSize / VTy->getScalarSizeInBits());
+
+  SmallVector<int, 8> ShuffleMask;
+  for (unsigned I = 0; I < VTy->getElementCount().getFixedValue(); I++)
+    ShuffleMask.push_back(I);
+
+  for (uint64_t I = VTy->getElementCount().getFixedValue();
+       I < ExpandedVecElementCount.getFixedValue(); I++)
+    ShuffleMask.push_back(VTy->getElementCount().getFixedValue());
+
+  Instruction *ExpandedVec =
+      cast<Instruction>(Builder.CreateShuffleVector(V, ShuffleMask));
+  LR.setConverted(cast<Instruction>(Builder.CreateBitCast(ExpandedVec, NewTy)));
+  LLVM_DEBUG(dbgs() << "\tConverted def to " << *LR.getConverted()->getType()
+                    << '\n');
+  return;
+}
+
+void LiveRegOptimizer::convertFromOptType(ConversionCandidateInfo &LRC) {
+  Type *OTy = LRC.getOriginalType();
+  VectorType *NewVTy = cast<VectorType>(LRC.getNewType());
+
+  TypeSize OriginalSize = DL->getTypeSizeInBits(OTy);
+  TypeSize NewSize = DL->getTypeSizeInBits(NewVTy);
+
+  auto &Builder = LRC.getConvertBuilder();
+  Value *V = cast<Value>(LRC.getLiveRegDef());
+  // If there is a bitsize match, we simply convert back to the original type
+  if (OriginalSize == NewSize) {
+    LRC.setConverted(cast<Instruction>(Builder.CreateBitCast(V, NewVTy)));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+                      << '\n');
+    return;
+  }
+
+  if (!OTy->isVectorTy()) {
+    Instruction *Trunc = cast<Instruction>(Builder.CreateTrunc(
+        LRC.getLiveRegDef(), IntegerType::get(Mod->getContext(), NewSize)));
+    Instruction *Original =
+        cast<Instruction>(Builder.CreateBitCast(Trunc, NewVTy));
+    LRC.setConverted(cast<Instruction>(Original));
+    LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted()
+                      << '\n');
+    return;
+  }
+
+  // If there is a bitsize mismatch, we have used a wider vector and must strip
+  // the MSBs to convert back to the original type
+  assert(OriginalSize > NewSize);
+  ElementCount ExpandedVecElementCount =
+      ElementCount::getFixed(OriginalSize / NewVTy->getScalarSizeInBits());
+  VectorType *ExpandedVT = VectorType::get(
+      Type::getIntNTy(Mod->getContext(), NewVTy->getScalarSizeInBits()),
+      ExpandedVecElementCount);
+  Instruction *Converted =
+      cast<Instruction>(Builder.CreateBitCast(LRC.getLiveRegDef(), ExpandedVT));
+
+  unsigned NarrowElementCount = NewVTy->getElementCount().getFixedValue();
+  SmallVector<int, 8> ShuffleMask(NarrowElementCount);
+  std::iota(ShuffleMask.begin(), ShuffleMask.end(), 0);
+
+  Instruction *NarrowVec =
+      cast<Instruction>(Builder.CreateShuffleVector(Converted, ShuffleMask));
+  LRC.setConverted(cast<Instruction>(NarrowVec));
+  LLVM_DEBUG(dbgs() << "\tProduced for user: " << *LRC.getConverted() << '\n');
+  return;
+}
+
+bool LiveRegOptimizer::shouldReplaceUses(const Instruction &I) {
+  // Vectors of illegal types are copied across blocks in an efficient manner.
+  // They are scalarized and widened to legal scalars. In such cases, we can do
+  // better by using legal vector types
+  Type *IType = I.getType();
+  return IType->isVectorTy() && IType->getScalarSizeInBits() < 16 &&
+         !I.getType()->getScalarType()->isPointerTy();
+}
+
 bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
   unsigned AS = LI.getPointerAddressSpace();
   // Skip non-constant address space.
@@ -119,7 +448,7 @@ bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
   // Skip non-simple loads.
   if (!LI.isSimple())
     return false;
-  auto *Ty = LI.getType();
+  Type *Ty = LI.getType();
   // Skip aggregate types.
   if (Ty->isAggregateType())
     return false;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 305a6c8c3b9262..c15481336075e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1150,10 +1150,10 @@ bool GCNPassConfig::addPreISel() {
   AMDGPUPassConfig::addPreISel();
 
   if (TM->getOptLevel() > CodeGenOptLevel::None)
-    addPass(createAMDGPULateCodeGenPreparePass());
+    addPass(createSinkingPass());
 
   if (TM->getOptLevel() > CodeGenOptLevel::None)
-    addPass(createSinkingPass());
+    addPass(createAMDGPULateCodeGenPreparePass());
 
   // Merge divergent exit nodes. StructurizeCFG won't recognize the multi-exit
   // regions formed by them.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 84320d296a037b..2fe2d506fa4308 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -306,6 +306,18 @@ bool GCNTTIImpl::hasBranchDivergence(const Function *F) const {
   return !F || !ST->isSingleLaneExecution(*F);
 }
 
+unsigned GCNTTIImpl::getNumberOfParts(Type *Tp) const {
+  if (auto VTy = dyn_cast<FixedVectorType>(...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/91016