[llvm] [AMDGPU] Split struct kernel arguments (PR #133786)

Yaxun Liu via llvm-commits llvm-commits at lists.llvm.org
Mon Mar 31 12:57:32 PDT 2025


https://github.com/yxsamliu created https://github.com/llvm/llvm-project/pull/133786

AMDGPU backend has a pass which does transformations to allow firmware to preload kernel arguments into sgpr's to avoid loading them from kernel arg segment. This pass can improve kernel latency but it cannot preload struct-type kernel arguments.

This patch adds a pass to AMDGPU backend to split and flat struct-type kernel arguments so that later passes can preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions with struct-type kenel args as oprands and makes them new arguments as the kernel. If all uses of a struct-type kernel arg can be replaced, it will do the replacements and create a new kernel with the new signature, and translate all instructions of the old kernel to use the new arguments in the new kernel. It adds a function attribute to encode the mapping from the new kernel argument index to the old kernel argument index and offset. The streamer will generate kernel argument metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled by LLVM option `-amdgpu-enable-split-kernel-args`.

>From 93b2c3e8b350d09296ca5c7b77830597350cf47a Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun.liu at amd.com>
Date: Fri, 13 Sep 2024 16:52:57 +0800
Subject: [PATCH] [AMDGPU] Split struct kernel arguments

AMDGPU backend has a pass which does transformations to allow
firmware to preload kernel arguments into sgpr's to avoid
loading them from kernel arg segment. This pass can improve
kernel latency but it cannot preload struct-type kernel
arguments.

This patch adds a pass to AMDGPU backend to split and flat
struct-type kernel arguments so that later passes can
preload them into sgpr's.

Basically, the pass collects load or GEP/load instructions
with struct-type kenel args as oprands and makes them
new arguments as the kernel. If all uses of a struct-type
kernel arg can be replaced, it will do the replacements
and create a new kernel with the new signature, and
translate all instructions of the old kernel to use
the new arguments in the new kernel. It adds a function
attribute to encode the mapping from the new kernel
argument index to the old kernel argument index and
offset. The streamer will generate kernel argument
metadata based on that and runtime will process
the kernel arguments based on the metadata.

The pass is disabled by default and can be enabled
by LLVM option `-amdgpu-enable-split-kernel-args`.
---
 llvm/lib/Target/AMDGPU/AMDGPU.h               |   9 +
 .../AMDGPU/AMDGPUHSAMetadataStreamer.cpp      |  43 +-
 .../Target/AMDGPU/AMDGPUHSAMetadataStreamer.h |   3 +-
 llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def |   1 +
 .../AMDGPU/AMDGPUSplitKernelArguments.cpp     | 372 ++++++++++++++++++
 .../lib/Target/AMDGPU/AMDGPUTargetMachine.cpp |   9 +-
 llvm/lib/Target/AMDGPU/CMakeLists.txt         |   1 +
 .../AMDGPU/amdgpu-split-kernel-args.ll        | 120 ++++++
 llvm/test/CodeGen/AMDGPU/llc-pipeline.ll      |   4 +
 9 files changed, 558 insertions(+), 4 deletions(-)
 create mode 100644 llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp
 create mode 100644 llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index a8e4ea9429f50..777390b99c0cc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -125,6 +125,15 @@ struct AMDGPUPromoteKernelArgumentsPass
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
 };
 
+ModulePass *createAMDGPUSplitKernelArgumentsPass();
+void initializeAMDGPUSplitKernelArgumentsPass(PassRegistry &);
+extern char &AMDGPUSplitKernelArgumentsID;
+
+struct AMDGPUSplitKernelArgumentsPass
+    : PassInfoMixin<AMDGPUSplitKernelArgumentsPass> {
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
 ModulePass *createAMDGPULowerKernelAttributesPass();
 void initializeAMDGPULowerKernelAttributesPass(PassRegistry &);
 extern char &AMDGPULowerKernelAttributesID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
index 2991778a1bbc7..d54828e225e12 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.cpp
@@ -357,17 +357,50 @@ void MetadataStreamerMsgPackV4::emitKernelArg(const Argument &Arg,
   Align ArgAlign;
   std::tie(ArgTy, ArgAlign) = getArgumentTypeAlign(Arg, DL);
 
+  // Assuming the argument is not split from struct-type argument by default,
+  // unless we find it in function attribute amdgpu-argument-mapping.
+  unsigned OriginalArgIndex = ~0U;
+  uint64_t OriginalArgOffset = 0;
+  if (Func->hasFnAttribute("amdgpu-argument-mapping")) {
+    StringRef MappingStr = Func->getFnAttribute("amdgpu-argument-mapping").getValueAsString();
+    SmallVector<StringRef, 8> Mappings;
+    MappingStr.split(Mappings, ',');
+    for (const StringRef &Mapping : Mappings) {
+      SmallVector<StringRef, 3> Elements;
+      Mapping.split(Elements, ':');
+      if (Elements.size() != 3)
+        continue;
+
+      unsigned NewArgIndex = 0;
+      unsigned OrigArgIndex = 0;
+      uint64_t OffsetValue = 0;
+      if (Elements[0].getAsInteger(10, NewArgIndex))
+        continue;
+      if (Elements[1].getAsInteger(10, OrigArgIndex))
+        continue;
+      if (Elements[2].getAsInteger(10, OffsetValue))
+        continue;
+
+      if (NewArgIndex == ArgNo) {
+        OriginalArgIndex = OrigArgIndex;
+        OriginalArgOffset = OffsetValue;
+        break;
+      }
+    }
+  }
+
   emitKernelArg(DL, ArgTy, ArgAlign,
                 getValueKind(ArgTy, TypeQual, BaseTypeName), Offset, Args,
                 PointeeAlign, Name, TypeName, BaseTypeName, ActAccQual,
-                AccQual, TypeQual);
+                AccQual, TypeQual, OriginalArgIndex, OriginalArgOffset);
 }
 
 void MetadataStreamerMsgPackV4::emitKernelArg(
     const DataLayout &DL, Type *Ty, Align Alignment, StringRef ValueKind,
     unsigned &Offset, msgpack::ArrayDocNode Args, MaybeAlign PointeeAlign,
     StringRef Name, StringRef TypeName, StringRef BaseTypeName,
-    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual) {
+    StringRef ActAccQual, StringRef AccQual, StringRef TypeQual,
+    unsigned OriginalArgIndex, uint64_t OriginalArgOffset) {
   auto Arg = Args.getDocument()->getMapNode();
 
   if (!Name.empty())
@@ -409,6 +442,12 @@ void MetadataStreamerMsgPackV4::emitKernelArg(
       Arg[".is_pipe"] = Arg.getDocument()->getNode(true);
   }
 
+  // Add original argument index and offset to the metadata
+  if (OriginalArgIndex != ~0U) {
+    Arg[".original_arg_index"] = Arg.getDocument()->getNode(OriginalArgIndex);
+    Arg[".original_arg_offset"] = Arg.getDocument()->getNode(OriginalArgOffset);
+  }
+
   Args.push_back(Arg);
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h
index 22dfcb4a4ec1d..312a1747f5c1d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUHSAMetadataStreamer.h
@@ -116,7 +116,8 @@ class LLVM_EXTERNAL_VISIBILITY MetadataStreamerMsgPackV4
                      MaybeAlign PointeeAlign = std::nullopt,
                      StringRef Name = "", StringRef TypeName = "",
                      StringRef BaseTypeName = "", StringRef ActAccQual = "",
-                     StringRef AccQual = "", StringRef TypeQual = "");
+                     StringRef AccQual = "", StringRef TypeQual = "",
+                     unsigned OriginalArgIndex = ~0U, uint64_t OriginalArgOffset = 0);
 
   void emitHiddenKernelArgs(const MachineFunction &MF, unsigned &Offset,
                             msgpack::ArrayDocNode Args) override;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 6a45392b5f099..094346670811c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -29,6 +29,7 @@ MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-remove-incompatible-functions", AMDGPURemoveIncompatibleFunctionsPass(*this))
 MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass(*this))
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-split-kernel-arguments", AMDGPUSplitKernelArgumentsPass())
 #undef MODULE_PASS
 
 #ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp
new file mode 100644
index 0000000000000..4a025e1806070
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitKernelArguments.cpp
@@ -0,0 +1,372 @@
+//===--- AMDGPUSplitKernelArguments.cpp - Split kernel arguments ----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// \file This pass flats struct-type kernel arguments. It eliminates unused
+// fields and only keeps used fields. The objective is to facilitate preloading
+// of kernel arguments by later passes.
+//
+//===----------------------------------------------------------------------===//
+#include "AMDGPU.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Module.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+
+#define DEBUG_TYPE "amdgpu-split-kernel-arguments"
+
+using namespace llvm;
+
+namespace {
+static llvm::cl::opt<bool> EnableSplitKernelArgs(
+    "amdgpu-enable-split-kernel-args",
+    llvm::cl::desc("Enable splitting of AMDGPU kernel arguments"),
+    llvm::cl::init(false));
+
+class AMDGPUSplitKernelArguments : public ModulePass {
+public:
+  static char ID;
+
+  AMDGPUSplitKernelArguments() : ModulePass(ID) {}
+
+  bool runOnModule(Module &M) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.setPreservesCFG();
+  }
+
+private:
+  bool processFunction(Function &F);
+};
+
+} // end anonymous namespace
+
+bool AMDGPUSplitKernelArguments::processFunction(Function &F) {
+  const DataLayout &DL = F.getParent()->getDataLayout();
+  LLVM_DEBUG(dbgs() << "Entering AMDGPUSplitKernelArguments::processFunction "
+                    << F.getName() << '\n');
+  if (F.isDeclaration()) {
+    LLVM_DEBUG(dbgs() << "Function is a declaration, skipping\n");
+    return false;
+  }
+
+  CallingConv::ID CC = F.getCallingConv();
+  if (CC != CallingConv::AMDGPU_KERNEL || F.arg_empty()) {
+    LLVM_DEBUG(dbgs() << "non-kernel or arg_empty\n");
+    return false;
+  }
+
+  SmallVector<std::tuple<unsigned, unsigned, uint64_t>, 8> NewArgMappings;
+  DenseMap<Argument *, SmallVector<LoadInst *, 8>> ArgToLoadsMap;
+  DenseMap<Argument *, SmallVector<GetElementPtrInst *, 8>> ArgToGEPsMap;
+  SmallVector<Argument *, 8> StructArgs;
+  SmallVector<Type *, 8> NewArgTypes;
+
+  auto convertAddressSpace = [](Type *Ty) -> Type * {
+    if (auto *PtrTy = dyn_cast<PointerType>(Ty)) {
+      if (PtrTy->getAddressSpace() == AMDGPUAS::FLAT_ADDRESS) {
+        return PointerType::get(PtrTy->getContext(), AMDGPUAS::GLOBAL_ADDRESS);
+      }
+    }
+    return Ty;
+  };
+
+  // Collect struct arguments and new argument types
+  unsigned OriginalArgIndex = 0;
+  unsigned NewArgIndex = 0;
+  for (Argument &Arg : F.args()) {
+    LLVM_DEBUG(dbgs() << "Processing argument: " << Arg << "\n");
+    if (Arg.use_empty()) {
+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));
+      NewArgMappings.push_back(
+          std::make_tuple(NewArgIndex, OriginalArgIndex, 0));
+      ++NewArgIndex;
+      ++OriginalArgIndex;
+      LLVM_DEBUG(dbgs() << "use empty\n");
+      continue;
+    }
+
+    PointerType *PT = dyn_cast<PointerType>(Arg.getType());
+    if (!PT) {
+      NewArgTypes.push_back(Arg.getType());
+      LLVM_DEBUG(dbgs() << "not a pointer\n");
+      // Include mapping if indices have changed
+      if (NewArgIndex != OriginalArgIndex)
+        NewArgMappings.push_back(
+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));
+      ++NewArgIndex;
+      ++OriginalArgIndex;
+      continue;
+    }
+
+    const bool IsByRef = Arg.hasByRefAttr();
+    if (!IsByRef) {
+      NewArgTypes.push_back(Arg.getType());
+      LLVM_DEBUG(dbgs() << "not byref\n");
+      // Include mapping if indices have changed
+      if (NewArgIndex != OriginalArgIndex)
+        NewArgMappings.push_back(
+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));
+      ++NewArgIndex;
+      ++OriginalArgIndex;
+      continue;
+    }
+
+    Type *ArgTy = Arg.getParamByRefType();
+    StructType *ST = dyn_cast<StructType>(ArgTy);
+    if (!ST) {
+      NewArgTypes.push_back(Arg.getType());
+      LLVM_DEBUG(dbgs() << "not a struct\n");
+      // Include mapping if indices have changed
+      if (NewArgIndex != OriginalArgIndex)
+        NewArgMappings.push_back(
+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));
+      ++NewArgIndex;
+      ++OriginalArgIndex;
+      continue;
+    }
+
+    bool AllLoadsOrGEPs = true;
+    SmallVector<LoadInst *, 8> Loads;
+    SmallVector<GetElementPtrInst *, 8> GEPs;
+    for (User *U : Arg.users()) {
+      LLVM_DEBUG(dbgs() << "  User: " << *U << "\n");
+      if (auto *LI = dyn_cast<LoadInst>(U)) {
+        Loads.push_back(LI);
+      } else if (auto *GEP = dyn_cast<GetElementPtrInst>(U)) {
+        GEPs.push_back(GEP);
+        for (User *GEPUser : GEP->users()) {
+          LLVM_DEBUG(dbgs() << "    GEP User: " << *GEPUser << "\n");
+          if (auto *GEPLoad = dyn_cast<LoadInst>(GEPUser)) {
+            Loads.push_back(GEPLoad);
+          } else {
+            AllLoadsOrGEPs = false;
+            break;
+          }
+        }
+      } else {
+        AllLoadsOrGEPs = false;
+        break;
+      }
+      if (!AllLoadsOrGEPs)
+        break;
+    }
+    LLVM_DEBUG(dbgs() << "  AllLoadsOrGEPs: "
+                      << (AllLoadsOrGEPs ? "true" : "false") << "\n");
+
+    if (AllLoadsOrGEPs) {
+      StructArgs.push_back(&Arg);
+      ArgToLoadsMap[&Arg] = Loads;
+      ArgToGEPsMap[&Arg] = GEPs;
+      for (LoadInst *LI : Loads) {
+        Type *NewType = convertAddressSpace(LI->getType());
+        NewArgTypes.push_back(NewType);
+
+        // Compute offset
+        uint64_t Offset = 0;
+        if (auto *GEP = dyn_cast<GetElementPtrInst>(LI->getPointerOperand())) {
+          APInt OffsetAPInt(DL.getPointerSizeInBits(), 0);
+          if (GEP->accumulateConstantOffset(DL, OffsetAPInt))
+            Offset = OffsetAPInt.getZExtValue();
+        }
+
+        // Map each new argument to the original argument index and offset
+        NewArgMappings.push_back(
+            std::make_tuple(NewArgIndex, OriginalArgIndex, Offset));
+        ++NewArgIndex;
+      }
+    } else {
+      NewArgTypes.push_back(convertAddressSpace(Arg.getType()));
+      // Include mapping if indices have changed
+      if (NewArgIndex != OriginalArgIndex)
+        NewArgMappings.push_back(
+            std::make_tuple(NewArgIndex, OriginalArgIndex, 0));
+      ++NewArgIndex;
+    }
+    ++OriginalArgIndex;
+  }
+
+  if (StructArgs.empty())
+    return false;
+
+  // Collect function and return attributes
+  AttributeList OldAttrs = F.getAttributes();
+  AttributeSet FnAttrs = OldAttrs.getFnAttrs();
+  AttributeSet RetAttrs = OldAttrs.getRetAttrs();
+
+  // Create new function type
+  FunctionType *NewFT =
+      FunctionType::get(F.getReturnType(), NewArgTypes, F.isVarArg());
+  Function *NewF =
+      Function::Create(NewFT, F.getLinkage(), F.getAddressSpace(), F.getName());
+  F.getParent()->getFunctionList().insert(F.getIterator(), NewF);
+  NewF->takeName(&F);
+  NewF->setVisibility(F.getVisibility());
+  if (F.hasComdat())
+    NewF->setComdat(F.getComdat());
+  NewF->setDSOLocal(F.isDSOLocal());
+  NewF->setUnnamedAddr(F.getUnnamedAddr());
+  NewF->setCallingConv(F.getCallingConv());
+
+  // Build new parameter attributes
+  SmallVector<AttributeSet, 8> NewArgAttrSets;
+  NewArgIndex = 0;
+  for (Argument &Arg : F.args()) {
+    if (ArgToLoadsMap.count(&Arg)) {
+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {
+        (void)LI;
+        NewArgAttrSets.push_back(AttributeSet());
+        ++NewArgIndex;
+      }
+    } else {
+      AttributeSet ArgAttrs = OldAttrs.getParamAttrs(Arg.getArgNo());
+      NewArgAttrSets.push_back(ArgAttrs);
+      ++NewArgIndex;
+    }
+  }
+
+  // Build the new AttributeList
+  AttributeList NewAttrList =
+      AttributeList::get(F.getContext(), FnAttrs, RetAttrs, NewArgAttrSets);
+  NewF->setAttributes(NewAttrList);
+
+  // Add the mapping of the new arguments to the old arguments as a function
+  // attribute in the format "NewArgIndex:OriginalArgIndex:Offset,..."
+  std::string MappingStr;
+  for (const auto &Info : NewArgMappings) {
+    unsigned NewArgIdx, OrigArgIdx;
+    uint64_t Offset;
+    std::tie(NewArgIdx, OrigArgIdx, Offset) = Info;
+
+    if (!MappingStr.empty())
+      MappingStr += ",";
+    MappingStr += std::to_string(NewArgIdx) + ":" + std::to_string(OrigArgIdx) +
+                  ":" + std::to_string(Offset);
+  }
+
+  NewF->addFnAttr("amdgpu-argument-mapping", MappingStr);
+
+  LLVM_DEBUG(dbgs() << "New empty function:\n" << *NewF << '\n');
+
+  NewF->splice(NewF->begin(), &F);
+
+  // Map old arguments and loads to new arguments
+  DenseMap<Value *, Value *> VMap;
+  auto NewArgIt = NewF->arg_begin();
+  for (Argument &Arg : F.args()) {
+    if (ArgToLoadsMap.count(&Arg)) {
+      for (LoadInst *LI : ArgToLoadsMap[&Arg]) {
+        std::string OldName = LI->getName().str();
+        LI->setName(OldName + ".old");
+        NewArgIt->setName(OldName);
+        Value *NewArg = &*NewArgIt++;
+        if (isa<PointerType>(NewArg->getType()) &&
+            isa<PointerType>(LI->getType())) {
+          IRBuilder<> Builder(LI);
+          Value *CastedArg = Builder.CreatePointerBitCastOrAddrSpaceCast(
+              NewArg, LI->getType());
+          VMap[LI] = CastedArg;
+        } else {
+          VMap[LI] = NewArg;
+        }
+      }
+      UndefValue *UndefArg = UndefValue::get(Arg.getType());
+      Arg.replaceAllUsesWith(UndefArg);
+    } else {
+      std::string OldName = Arg.getName().str();
+      Arg.setName(OldName + ".old");
+      NewArgIt->setName(OldName);
+      Value *NewArg = &*NewArgIt;
+      if (isa<PointerType>(NewArg->getType()) &&
+          isa<PointerType>(Arg.getType())) {
+        IRBuilder<> Builder(&*NewF->begin()->begin());
+        Value *CastedArg =
+            Builder.CreatePointerBitCastOrAddrSpaceCast(NewArg, Arg.getType());
+        Arg.replaceAllUsesWith(CastedArg);
+      } else {
+        Arg.replaceAllUsesWith(NewArg);
+      }
+      ++NewArgIt;
+    }
+  }
+
+  // Replace LoadInsts with new arguments
+  for (auto &Entry : ArgToLoadsMap) {
+    for (LoadInst *LI : Entry.second) {
+      Value *NewArg = VMap[LI];
+      LI->replaceAllUsesWith(NewArg);
+      LI->eraseFromParent();
+    }
+  }
+
+  // Erase GEPs
+  for (auto &Entry : ArgToGEPsMap) {
+    for (GetElementPtrInst *GEP : Entry.second) {
+      if (GEP->use_empty()) {
+        GEP->eraseFromParent();
+      } else {
+        GEP->replaceAllUsesWith(UndefValue::get(GEP->getType()));
+        GEP->eraseFromParent();
+      }
+    }
+  }
+
+  LLVM_DEBUG(dbgs() << "New function after transformation:\n" << *NewF << '\n');
+
+  F.replaceAllUsesWith(NewF);
+  F.eraseFromParent();
+
+  return true;
+}
+
+bool AMDGPUSplitKernelArguments::runOnModule(Module &M) {
+  if (!EnableSplitKernelArgs)
+    return false;
+  bool Changed = false;
+  SmallVector<Function *, 16> FunctionsToProcess;
+
+  for (Function &F : M) {
+    if (F.isDeclaration())
+      continue;
+    FunctionsToProcess.push_back(&F);
+  }
+
+  for (Function *F : FunctionsToProcess) {
+    if (F->isDeclaration())
+      continue;
+    Changed |= processFunction(*F);
+  }
+
+  return Changed;
+}
+
+INITIALIZE_PASS_BEGIN(AMDGPUSplitKernelArguments, DEBUG_TYPE,
+                      "AMDGPU Split Kernel Arguments", false, false)
+INITIALIZE_PASS_END(AMDGPUSplitKernelArguments, DEBUG_TYPE,
+                    "AMDGPU Split Kernel Arguments", false, false)
+
+char AMDGPUSplitKernelArguments::ID = 0;
+
+ModulePass *llvm::createAMDGPUSplitKernelArgumentsPass() {
+  return new AMDGPUSplitKernelArguments();
+}
+
+PreservedAnalyses AMDGPUSplitKernelArgumentsPass::run(Module &M, ModuleAnalysisManager &AM) {
+  AMDGPUSplitKernelArguments Splitter;
+  bool Changed = Splitter.runOnModule(M);
+
+  if (!Changed)
+    return PreservedAnalyses::all();
+
+  return PreservedAnalyses::none();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 4937b434bc955..f5bb925a95b54 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -517,6 +517,7 @@ extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeAMDGPUAtomicOptimizerPass(*PR);
   initializeAMDGPULowerKernelArgumentsPass(*PR);
   initializeAMDGPUPromoteKernelArgumentsPass(*PR);
+  initializeAMDGPUSplitKernelArgumentsPass(*PR);
   initializeAMDGPULowerKernelAttributesPass(*PR);
   initializeAMDGPUExportKernelRuntimeHandlesLegacyPass(*PR);
   initializeAMDGPUPostLegalizerCombinerPass(*PR);
@@ -876,8 +877,10 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
                                             OptimizationLevel Level,
                                             ThinOrFullLTOPhase Phase) {
     if (Level != OptimizationLevel::O0) {
-      if (!isLTOPreLink(Phase))
+      if (!isLTOPreLink(Phase)) {
+        MPM.addPass(AMDGPUSplitKernelArgumentsPass());
         MPM.addPass(AMDGPUAttributorPass(*this));
+      }
     }
   });
 
@@ -896,6 +899,7 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
             PM.addPass(InternalizePass(mustPreserveGV));
             PM.addPass(GlobalDCEPass());
           }
+          PM.addPass(AMDGPUSplitKernelArgumentsPass());
           if (EnableAMDGPUAttributor) {
             AMDGPUAttributorOptions Opt;
             if (HasClosedWorldAssumption)
@@ -1237,6 +1241,9 @@ void AMDGPUPassConfig::addIRPasses() {
     addPass(createAMDGPULowerModuleLDSLegacyPass(&TM));
   }
 
+  if (TM.getOptLevel() > CodeGenOptLevel::None) {
+    addPass(createAMDGPUSplitKernelArgumentsPass());
+  }
   if (TM.getOptLevel() > CodeGenOptLevel::None)
     addPass(createInferAddressSpacesPass());
 
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index 09a3096602fc3..bc30e24d92d2b 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -92,6 +92,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUPrintfRuntimeBinding.cpp
   AMDGPUPromoteAlloca.cpp
   AMDGPUPromoteKernelArguments.cpp
+  AMDGPUSplitKernelArguments.cpp
   AMDGPURegBankCombiner.cpp
   AMDGPURegBankLegalize.cpp
   AMDGPURegBankLegalizeHelper.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll
new file mode 100644
index 0000000000000..99de32f92aa7f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-split-kernel-args.ll
@@ -0,0 +1,120 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-attributes --check-globals all --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-split-kernel-arguments -amdgpu-enable-split-kernel-args -verify-machineinstrs < %s | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-split-kernel-arguments -amdgpu-enable-split-kernel-args -verify-machineinstrs < %s | llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -verify-machineinstrs | FileCheck --check-prefix=GCN %s
+;
+; The LLVM IR is from the following HIP program:
+;
+; struct A {
+; int i;
+; char c;
+; long l;
+; int *p;
+; };
+
+; struct B {
+; char c;
+; A a1;
+; int i;
+; A a2;
+; };
+;
+; __global__ void test(int *out, int i, A a, char c, B b) {
+;  *out = i + a.l + c + a.l + b.a1.c;
+;  b.a2.p[2] = a.l + b.a2.c;
+;}
+;
+%struct.A = type { i32, i8, i64, ptr }
+%struct.B = type { i8, %struct.A, i32, %struct.A }
+
+define amdgpu_kernel void @_Z4testPii1Ac1B(ptr addrspace(1) noundef writeonly captures(none) initializes((0, 4)) %out.coerce, i32 noundef %i, ptr addrspace(4) noundef readonly byref(%struct.A) align 8 captures(none) %0, i8 noundef %c, ptr addrspace(4) noundef readonly byref(%struct.B) align 8 captures(none) %1) {
+; CHECK-LABEL: define amdgpu_kernel void @_Z4testPii1Ac1B(
+; CHECK-SAME: ptr addrspace(1) noundef writeonly captures(none) initializes((0, 4)) [[OUT_COERCE:%.*]], i32 noundef [[I:%.*]], i64 [[COERCE_SROA_1_0_COPYLOAD:%.*]], i8 noundef [[C:%.*]], ptr addrspace(1) [[COERCE1_SROA_318_0_COPYLOAD:%.*]], i8 [[COERCE1_SROA_217_0_COPYLOAD:%.*]], i8 [[COERCE1_SROA_1_0_COPYLOAD:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = addrspacecast ptr addrspace(1) [[COERCE1_SROA_318_0_COPYLOAD]] to ptr
+; CHECK-NEXT:    [[TMP1:%.*]] = addrspacecast ptr [[TMP0]] to ptr addrspace(1)
+; CHECK-NEXT:    [[CONV:%.*]] = zext i32 [[I]] to i64
+; CHECK-NEXT:    [[CONV3:%.*]] = sext i8 [[C]] to i64
+; CHECK-NEXT:    [[CONV8:%.*]] = sext i8 [[COERCE1_SROA_1_0_COPYLOAD]] to i64
+; CHECK-NEXT:    [[FACTOR:%.*]] = shl i64 [[COERCE_SROA_1_0_COPYLOAD]], 1
+; CHECK-NEXT:    [[ADD4:%.*]] = add nsw i64 [[CONV3]], [[CONV]]
+; CHECK-NEXT:    [[ADD6:%.*]] = add i64 [[ADD4]], [[FACTOR]]
+; CHECK-NEXT:    [[ADD9:%.*]] = add i64 [[ADD6]], [[CONV8]]
+; CHECK-NEXT:    [[CONV10:%.*]] = trunc i64 [[ADD9]] to i32
+; CHECK-NEXT:    store i32 [[CONV10]], ptr addrspace(1) [[OUT_COERCE]], align 4
+; CHECK-NEXT:    [[CONV13:%.*]] = sext i8 [[COERCE1_SROA_217_0_COPYLOAD]] to i64
+; CHECK-NEXT:    [[ADD14:%.*]] = add nsw i64 [[COERCE_SROA_1_0_COPYLOAD]], [[CONV13]]
+; CHECK-NEXT:    [[CONV15:%.*]] = trunc i64 [[ADD14]] to i32
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i8, ptr addrspace(1) [[TMP1]], i64 8
+; CHECK-NEXT:    store i32 [[CONV15]], ptr addrspace(1) [[ARRAYIDX]], align 4
+; CHECK-NEXT:    ret void
+;
+entry:
+  %coerce.sroa.1.0..sroa_idx = getelementptr inbounds nuw i8, ptr addrspace(4) %0, i64 8
+  %coerce.sroa.1.0.copyload = load i64, ptr addrspace(4) %coerce.sroa.1.0..sroa_idx, align 8
+  %coerce1.sroa.1.0..sroa_idx = getelementptr inbounds nuw i8, ptr addrspace(4) %1, i64 12
+  %coerce1.sroa.1.0.copyload = load i8, ptr addrspace(4) %coerce1.sroa.1.0..sroa_idx, align 4
+  %coerce1.sroa.217.0..sroa_idx = getelementptr inbounds nuw i8, ptr addrspace(4) %1, i64 44
+  %coerce1.sroa.217.0.copyload = load i8, ptr addrspace(4) %coerce1.sroa.217.0..sroa_idx, align 4
+  %coerce1.sroa.318.0..sroa_idx = getelementptr inbounds nuw i8, ptr addrspace(4) %1, i64 56
+  %coerce1.sroa.318.0.copyload = load ptr, ptr addrspace(4) %coerce1.sroa.318.0..sroa_idx, align 8
+  %2 = addrspacecast ptr %coerce1.sroa.318.0.copyload to ptr addrspace(1)
+  %conv = zext i32 %i to i64
+  %conv3 = sext i8 %c to i64
+  %conv8 = sext i8 %coerce1.sroa.1.0.copyload to i64
+  %factor = shl i64 %coerce.sroa.1.0.copyload, 1
+  %add4 = add nsw i64 %conv3, %conv
+  %add6 = add i64 %add4, %factor
+  %add9 = add i64 %add6, %conv8
+  %conv10 = trunc i64 %add9 to i32
+  store i32 %conv10, ptr addrspace(1) %out.coerce, align 4
+  %conv13 = sext i8 %coerce1.sroa.217.0.copyload to i64
+  %add14 = add nsw i64 %coerce.sroa.1.0.copyload, %conv13
+  %conv15 = trunc i64 %add14 to i32
+  %arrayidx = getelementptr inbounds nuw i8, ptr addrspace(1) %2, i64 8
+  store i32 %conv15, ptr addrspace(1) %arrayidx, align 4
+  ret void
+}
+
+;.
+; CHECK: attributes #[[ATTR0]] = { "amdgpu-argument-mapping"="2:2:8,4:4:56,5:4:44,6:4:12" }
+;.
+
+; GCN:   - .address_space:  global
+; GCN:     .name:           out.coerce
+; GCN:     .offset:         0
+; GCN:     .size:           8
+; GCN:     .value_kind:     global_buffer
+; GCN:   - .name:           i
+; GCN:     .offset:         8
+; GCN:     .size:           4
+; GCN:     .value_kind:     by_value
+; GCN:   - .name:           coerce.sroa.1.0.copyload
+; GCN:     .offset:         16
+; GCN:     .original_arg_index: 2
+; GCN:     .original_arg_offset: 8
+; GCN:     .size:           8
+; GCN:     .value_kind:     by_value
+; GCN:   - .name:           c
+; GCN:     .offset:         24
+; GCN:     .size:           1
+; GCN:     .value_kind:     by_value
+; GCN:   - .address_space:  global
+; GCN:     .name:           coerce1.sroa.318.0.copyload
+; GCN:     .offset:         32
+; GCN:     .original_arg_index: 4
+; GCN:     .original_arg_offset: 56
+; GCN:     .size:           8
+; GCN:     .value_kind:     global_buffer
+; GCN:   - .name:           coerce1.sroa.217.0.copyload
+; GCN:     .offset:         40
+; GCN:     .original_arg_index: 4
+; GCN:     .original_arg_offset: 44
+; GCN:     .size:           1
+; GCN:     .value_kind:     by_value
+; GCN:   - .name:           coerce1.sroa.1.0.copyload
+; GCN:     .offset:         41
+; GCN:     .original_arg_index: 4
+; GCN:     .original_arg_offset: 12
+; GCN:     .size:           1
+; GCN:     .value_kind:     by_value
+
diff --git a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
index 4b6cc32522f5b..44c53c7c20a0d 100644
--- a/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
+++ b/llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
@@ -191,6 +191,7 @@
 ; GCN-O1-NEXT:    Externalize enqueued block runtime handles
 ; GCN-O1-NEXT:    AMDGPU Software lowering of LDS
 ; GCN-O1-NEXT:    Lower uses of LDS variables from non-kernel functions
+; GCN-O1-NEXT:    AMDGPU Split Kernel Arguments
 ; GCN-O1-NEXT:    FunctionPass Manager
 ; GCN-O1-NEXT:      Infer address spaces
 ; GCN-O1-NEXT:      Dominator Tree Construction
@@ -476,6 +477,7 @@
 ; GCN-O1-OPTS-NEXT:    Externalize enqueued block runtime handles
 ; GCN-O1-OPTS-NEXT:    AMDGPU Software lowering of LDS
 ; GCN-O1-OPTS-NEXT:    Lower uses of LDS variables from non-kernel functions
+; GCN-O1-OPTS-NEXT:    AMDGPU Split Kernel Arguments
 ; GCN-O1-OPTS-NEXT:    FunctionPass Manager
 ; GCN-O1-OPTS-NEXT:      Infer address spaces
 ; GCN-O1-OPTS-NEXT:      Dominator Tree Construction
@@ -791,6 +793,7 @@
 ; GCN-O2-NEXT:    Externalize enqueued block runtime handles
 ; GCN-O2-NEXT:    AMDGPU Software lowering of LDS
 ; GCN-O2-NEXT:    Lower uses of LDS variables from non-kernel functions
+; GCN-O2-NEXT:    AMDGPU Split Kernel Arguments
 ; GCN-O2-NEXT:    FunctionPass Manager
 ; GCN-O2-NEXT:      Infer address spaces
 ; GCN-O2-NEXT:      Dominator Tree Construction
@@ -1110,6 +1113,7 @@
 ; GCN-O3-NEXT:    Externalize enqueued block runtime handles
 ; GCN-O3-NEXT:    AMDGPU Software lowering of LDS
 ; GCN-O3-NEXT:    Lower uses of LDS variables from non-kernel functions
+; GCN-O3-NEXT:    AMDGPU Split Kernel Arguments
 ; GCN-O3-NEXT:    FunctionPass Manager
 ; GCN-O3-NEXT:      Infer address spaces
 ; GCN-O3-NEXT:      Dominator Tree Construction



More information about the llvm-commits mailing list