[llvm] AMDGPU: share LDS budget logic and add experimental LDS buffering pass (PR #166388)

Tue Nov 4 17:58:48 PST 2025

================
@@ -0,0 +1,341 @@
+//===-- AMDGPULDSBuffering.cpp - Per-thread LDS buffering -----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass buffers per-thread global memory accesses through LDS
+// (addrspace(3)) to improve performance in memory-bound kernels. The main
+// purpose is to alleviate global memory contention and cache thrashing when
+// the same global pointer is used for both load and store operations.
+//
+// The pass runs late in the pipeline, after SROA and AMDGPUPromoteAlloca,
+// using only leftover LDS budget to avoid interfering with other LDS
+// optimizations. It respects the same LDS budget constraints as
+// AMDGPUPromoteAlloca, ensuring that LDS usage remains within occupancy
+// tier limits.
+//
+// Current implementation handles the simplest pattern: a load from global
+// memory whose only use is a store back to the same pointer. This pattern
+// is transformed into a pair of memcpy operations (global->LDS and
+// LDS->global), effectively moving the value through LDS instead of
+// accessing global memory directly.
+//
+// This pass was inspired by finding that some rocrand performance tests
+// show better performance when global memory is buffered through LDS
+// instead of being loaded/stored to registers directly. This optimization
+// is experimental and must be enabled via the -amdgpu-enable-lds-buffering
+// flag.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "GCNSubtarget.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "Utils/AMDGPULDSUtils.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/PassManager.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/Alignment.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Target/TargetMachine.h"
+
+#define DEBUG_TYPE "amdgpu-lds-buffering"
+
+using namespace llvm;
+
+namespace {
+
+static cl::opt<unsigned>
+    LDSBufferingMaxBytes("amdgpu-lds-buffering-max-bytes",
+                         cl::desc("Max byte size for LDS buffering candidates"),
+                         cl::init(64));
----------------
arsenm wrote:

Should be pass parameter 

https://github.com/llvm/llvm-project/pull/166388