[llvm] [AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (PR #87265)

Matt Arsenault via llvm-commits llvm-commits at lists.llvm.org
Mon Aug 12 11:51:30 PDT 2024


================
@@ -0,0 +1,1335 @@
+//===-- AMDGPUSwLowerLDS.cpp -----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass lowers the local data store, LDS, uses in kernel and non-kernel
+// functions in module to use dynamically allocated global memory.
+// Packed LDS Layout is emulated in the global memory.
+// The lowered memory instructions from LDS to global memory are then
+// instrumented for address sanitizer, to catch addressing errors.
+//
+// Replacement of Kernel LDS accesses:
+//    For a kernel, LDS access can be static or dynamic which are direct
+//    (accessed within kernel) and indirect (accessed through non-kernels).
+//    All these LDS accesses corresponding to kernel will be packed together,
+//    where all static LDS accesses will be allocated first and then dynamic
+//    LDS follows. The total size with alignment is calculated. A new LDS global
+//    will be created for the kernel called "SW LDS" and it will have the
+//    attribute "amdgpu-lds-size" attached with value of the size calculated.
+//    All the LDS accesses in the module will be replaced by GEP with offset
+//    into the "Sw LDS".
+//    A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
+//    the dynamic LDS. This will be marked used by kernel and will have
+//    MD_absolue_symbol metadata set to total static LDS size, Since dynamic
+//    LDS allocation starts after all static LDS allocation.
+//
+//    A device global memory equal to the total LDS size will be allocated.
+//    At the prologue of the kernel, a single work-item from the
+//    work-group, does a "malloc" and stores the pointer of the
+//    allocation in "SW LDS".
+//
+//    To store the offsets corresponding to all LDS accesses, another global
+//    variable is created which will be called "SW LDS metadata" in this pass.
+//    - SW LDS Global:
+//        It is LDS global of ptr type with name
+//        "llvm.amdgcn.sw.lds.<kernel-name>".
+//    - Metadata Global:
+//        It is of struct type, with n members. n equals the number of LDS
+//        globals accessed by the kernel(direct and indirect). Each member of
+//        struct is another struct of type {i32, i32, i32}. First member
+//        corresponds to offset, second member corresponds to size of LDS global
+//        being replaced and third represents the total aligned size. It will
+//        have name "llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have
+//        an intializer with static LDS related offsets and sizes initialized.
+//        But for dynamic LDS related entries, offsets will be intialized to
+//        previous static LDS allocation end offset. Sizes for them will be zero
+//        initially. These dynamic LDS offset and size values will be updated
+//        with in the kernel, since kernel can read the dynamic LDS size
+//        allocation done at runtime with query to "hidden_dynamic_lds_size"
+//        hidden kernel argument.
+//
+//    At the epilogue of kernel, allocated memory would be made free by the same
+//    single work-item.
+//
+// Replacement of non-kernel LDS accesses:
+//    Multiple kernels can access the same non-kernel function.
+//    All the kernels accessing LDS through non-kernels are sorted and
+//    assigned a kernel-id. All the LDS globals accessed by non-kernels
+//    are sorted. This information is used to build two tables:
+//    - Base table:
+//        Base table will have single row, with elements of the row
+//        placed as per kernel ID. Each element in the row corresponds
+//        to ptr of "SW LDS" variable created for that kernel.
+//    - Offset table:
+//        Offset table will have multiple rows and columns.
+//        Rows are assumed to be from 0 to (n-1). n is total number
+//        of kernels accessing the LDS through non-kernels.
+//        Each row will have m elements. m is the total number of
+//        unique LDS globals accessed by all non-kernels.
+//        Each element in the row correspond to the ptr of
+//        the replacement of LDS global done by that particular kernel.
+//    A LDS variable in non-kernel will be replaced based on the information
+//    from base and offset tables. Based on kernel-id query, ptr of "SW
+//    LDS" for that corresponding kernel is obtained from base table.
+//    The Offset into the base "SW LDS" is obtained from
+//    corresponding element in offset table. With this information, replacement
+//    value is obtained.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUAsanInstrumentation.h"
+#include "AMDGPUTargetMachine.h"
+#include "Utils/AMDGPUMemoryUtils.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/ADT/SetOperations.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/DIBuilder.h"
+#include "llvm/IR/DebugInfo.h"
+#include "llvm/IR/DebugInfoMetadata.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/ReplaceConstant.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Instrumentation/AddressSanitizerCommon.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
+
+#include <algorithm>
+
+#define DEBUG_TYPE "amdgpu-sw-lower-lds"
+#define COV5_HIDDEN_DYN_LDS_SIZE_ARG 15
+
+using namespace llvm;
+using namespace AMDGPU;
+
+namespace {
+
+cl::opt<bool>
+    AsanInstrumentLDS("amdgpu-asan-instrument-lds",
+                      cl::desc("Run asan instrumentation on LDS instructions "
+                               "lowered to global memory"),
+                      cl::init(true), cl::Hidden);
+
+using DomTreeCallback = function_ref<DominatorTree *(Function &F)>;
+
+struct LDSAccessTypeInfo {
+  SetVector<GlobalVariable *> StaticLDSGlobals;
+  SetVector<GlobalVariable *> DynamicLDSGlobals;
+};
+
+// Struct to hold all the Metadata required for a kernel
+// to replace a LDS global uses with corresponding offset
+// in to device global memory.
+struct KernelLDSParameters {
+  GlobalVariable *SwLDS = nullptr;
+  GlobalVariable *SwDynLDS = nullptr;
+  GlobalVariable *SwLDSMetadata = nullptr;
+  LDSAccessTypeInfo DirectAccess;
+  LDSAccessTypeInfo IndirectAccess;
+  DenseMap<GlobalVariable *, SmallVector<uint32_t, 3>>
+      LDSToReplacementIndicesMap;
+  uint32_t MallocSize = 0;
+  uint32_t LDSSize = 0;
+  SmallVector<std::pair<uint32_t, uint32_t>, 64> RedzoneOffsetAndSizeVector;
+};
+
+// Struct to store infor for creation of offset table
+// for all the non-kernel LDS accesses.
+struct NonKernelLDSParameters {
+  GlobalVariable *LDSBaseTable = nullptr;
+  GlobalVariable *LDSOffsetTable = nullptr;
+  SetVector<Function *> OrderedKernels;
+  SetVector<GlobalVariable *> OrdereLDSGlobals;
+};
+
+struct AsanInstrumentInfo {
+  int Scale = 0;
+  uint32_t Offset = 0;
+  SetVector<Instruction *> Instructions;
+};
+
+struct FunctionsAndLDSAccess {
+  DenseMap<Function *, KernelLDSParameters> KernelToLDSParametersMap;
+  SetVector<Function *> KernelsWithIndirectLDSAccess;
+  SetVector<Function *> NonKernelsWithLDSArgument;
+  SetVector<GlobalVariable *> AllNonKernelLDSAccess;
+  FunctionVariableMap NonKernelToLDSAccessMap;
+};
+
+class AMDGPUSwLowerLDS {
+public:
+  AMDGPUSwLowerLDS(Module &Mod, const AMDGPUTargetMachine &TM,
+                   DomTreeCallback Callback)
+      : M(Mod), AMDGPUTM(TM), IRB(M.getContext()), DTCallback(Callback) {}
+  bool run();
+  void getUsesOfLDSByNonKernels();
+  void getNonKernelsWithLDSArguments(const CallGraph &CG);
+  SetVector<Function *>
+  getOrderedIndirectLDSAccessingKernels(SetVector<Function *> &Kernels);
+  SetVector<GlobalVariable *>
+  getOrderedNonKernelAllLDSGlobals(SetVector<GlobalVariable *> &Variables);
+  void buildSwLDSGlobal(Function *Func);
+  void buildSwDynLDSGlobal(Function *Func);
+  void populateSwMetadataGlobal(Function *Func);
+  void populateSwLDSAttributeAndMetadata(Function *Func);
+  void populateLDSToReplacementIndicesMap(Function *Func);
+  void getLDSMemoryInstructions(Function *Func,
+                                SetVector<Instruction *> &LDSInstructions);
+  void replaceKernelLDSAccesses(Function *Func);
+  Value *getTranslatedGlobalMemoryGEPOfLDSPointer(Value *LoadMallocPtr,
+                                                  Value *LDSPtr);
+  void translateLDSMemoryOperationsToGlobalMemory(
+      Function *Func, Value *LoadMallocPtr,
+      SetVector<Instruction *> &LDSInstructions);
+  void poisonRedzones(Function *Func, Value *MallocPtr);
+  void lowerKernelLDSAccesses(Function *Func, DomTreeUpdater &DTU);
+  void buildNonKernelLDSOffsetTable(NonKernelLDSParameters &NKLDSParams);
+  void buildNonKernelLDSBaseTable(NonKernelLDSParameters &NKLDSParams);
+  Constant *
+  getAddressesOfVariablesInKernel(Function *Func,
+                                  SetVector<GlobalVariable *> &Variables);
+  void lowerNonKernelLDSAccesses(Function *Func,
+                                 SetVector<GlobalVariable *> &LDSGlobals,
+                                 NonKernelLDSParameters &NKLDSParams);
+  void
+  updateMallocSizeForDynamicLDS(Function *Func, Value **CurrMallocSize,
+                                Value *HiddenDynLDSSize,
+                                SetVector<GlobalVariable *> &DynamicLDSGlobals);
+  void initAsanInfo();
+
+private:
+  Module &M;
+  const AMDGPUTargetMachine &AMDGPUTM;
+  IRBuilder<> IRB;
+  DomTreeCallback DTCallback;
+  FunctionsAndLDSAccess FuncLDSAccessInfo;
+  AsanInstrumentInfo AsanInfo;
+};
+
+template <typename T> SetVector<T> sortByName(std::vector<T> &&V) {
+  // Sort the vector of globals or Functions based on their name.
+  // Returns a SetVector of globals/Functions.
+  sort(V, [](const auto *L, const auto *R) {
+    return L->getName() < R->getName();
+  });
+  return {SetVector<T>(V.begin(), V.end())};
+}
+
+SetVector<GlobalVariable *> AMDGPUSwLowerLDS::getOrderedNonKernelAllLDSGlobals(
+    SetVector<GlobalVariable *> &Variables) {
+  // Sort all the non-kernel LDS accesses based on their name.
+  return sortByName(
+      std::vector<GlobalVariable *>(Variables.begin(), Variables.end()));
+}
+
+SetVector<Function *> AMDGPUSwLowerLDS::getOrderedIndirectLDSAccessingKernels(
+    SetVector<Function *> &Kernels) {
+  // Sort the non-kernels accessing LDS based on their name.
+  // Also assign a kernel ID metadata based on the sorted order.
+  LLVMContext &Ctx = M.getContext();
+  if (Kernels.size() > UINT32_MAX) {
+    report_fatal_error("Unimplemented SW LDS lowering for > 2**32 kernels");
+  }
+  SetVector<Function *> OrderedKernels =
+      sortByName(std::vector<Function *>(Kernels.begin(), Kernels.end()));
+  for (size_t i = 0; i < Kernels.size(); i++) {
+    Metadata *AttrMDArgs[1] = {
+        ConstantAsMetadata::get(IRB.getInt32(i)),
+    };
+    Function *Func = OrderedKernels[i];
+    Func->setMetadata("llvm.amdgcn.lds.kernel.id",
+                      MDNode::get(Ctx, AttrMDArgs));
+  }
+  return std::move(OrderedKernels);
+}
+
+void AMDGPUSwLowerLDS::getNonKernelsWithLDSArguments(const CallGraph &CG) {
+  // Among the kernels accessing LDS, get list of
+  // Non-kernels to which a call is made and a ptr
+  // to addrspace(3) is passed as argument.
+  for (auto &K : FuncLDSAccessInfo.KernelToLDSParametersMap) {
+    Function *Func = K.first;
+    const CallGraphNode *CGN = CG[Func];
+    if (!CGN)
+      continue;
+    for (auto &I : *CGN) {
+      CallGraphNode *CallerCGN = I.second;
+      Function *CalledFunc = CallerCGN->getFunction();
+      if (!CalledFunc)
+        continue;
+      if (AMDGPU::isKernelLDS(CalledFunc))
+        continue;
+      for (auto AI = CalledFunc->arg_begin(), E = CalledFunc->arg_end();
+           AI != E; ++AI) {
+        Type *ArgTy = (*AI).getType();
+        if (!ArgTy->isPointerTy())
+          continue;
+        if (ArgTy->getPointerAddressSpace() != AMDGPUAS::LOCAL_ADDRESS)
+          continue;
+        FuncLDSAccessInfo.NonKernelsWithLDSArgument.insert(CalledFunc);
+        // Also add the Calling function to KernelsWithIndirectLDSAccess list
+        // so that base table of LDS is generated.
+        FuncLDSAccessInfo.KernelsWithIndirectLDSAccess.insert(Func);
+      }
+    }
+  }
+}
+
+void AMDGPUSwLowerLDS::getUsesOfLDSByNonKernels() {
+  for (GlobalVariable *GV : FuncLDSAccessInfo.AllNonKernelLDSAccess) {
+    if (!AMDGPU::isLDSVariableToLower(*GV))
+      continue;
+
+    for (User *V : GV->users()) {
+      if (auto *I = dyn_cast<Instruction>(V)) {
+        Function *F = I->getFunction();
+        if (!isKernelLDS(F) && F->hasFnAttribute(Attribute::SanitizeAddress))
+          FuncLDSAccessInfo.NonKernelToLDSAccessMap[F].insert(GV);
+      }
+    }
+  }
+}
+
+static void recordLDSAbsoluteAddress(Module &M, GlobalVariable *GV,
+                                     uint32_t Address) {
+  // Write the specified address into metadata where it can be retrieved by
+  // the assembler. Format is a half open range, [Address Address+1)
+  LLVMContext &Ctx = M.getContext();
+  auto *IntTy = M.getDataLayout().getIntPtrType(Ctx, AMDGPUAS::LOCAL_ADDRESS);
+  auto *MinC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address));
+  auto *MaxC = ConstantAsMetadata::get(ConstantInt::get(IntTy, Address + 1));
+  GV->setMetadata(LLVMContext::MD_absolute_symbol,
+                  MDNode::get(Ctx, {MinC, MaxC}));
+}
+
+static void addLDSSizeAttribute(Function *Func, uint32_t Offset,
+                                bool IsDynLDS) {
+  if (Offset != 0) {
+    std::string Buffer;
+    raw_string_ostream SS{Buffer};
+    SS << format("%u", Offset);
+    if (IsDynLDS)
+      SS << format(",%u", Offset);
----------------
arsenm wrote:

Shouldn't need format, just << Offset should work 

https://github.com/llvm/llvm-project/pull/87265


More information about the llvm-commits mailing list