[all-commits] [llvm/llvm-project] 49bc3f: [LowerMemIntrinsics] Optimize memset lowering

Wed Dec 3 00:55:58 PST 2025

  Branch: refs/heads/users/ritter-x2a/08-20-_lowermemintrinsics_optimize_memset_lowering
  Home:   https://github.com/llvm/llvm-project
  Commit: 49bc3f002d40f09d568952715a824a1e0fd2ed5d
      https://github.com/llvm/llvm-project/commit/49bc3f002d40f09d568952715a824a1e0fd2ed5d
  Author: Fabian Ritter <fabian.ritter at amd.com>
  Date:   2025-12-03 (Wed, 03 Dec 2025)

  Changed paths:
    M llvm/include/llvm/Transforms/Utils/LowerMemIntrinsics.h
    M llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
    M llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
    M llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
    M llvm/lib/Target/NVPTX/NVPTXLowerAggrCopies.cpp
    M llvm/lib/Target/SPIRV/SPIRVPrepareFunctions.cpp
    M llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
    M llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
    M llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
    M llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll
    M llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics-threshold.ll
    M llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
    M llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
    A llvm/test/CodeGen/AMDGPU/memset-param-combinations.ll
    M llvm/test/CodeGen/NVPTX/lower-aggr-copies.ll
    M llvm/test/CodeGen/SPIRV/llvm-intrinsics/memset.ll
    M llvm/test/Transforms/PreISelIntrinsicLowering/X86/memset-inline-non-constant-len.ll

  Log Message:
  -----------
  [LowerMemIntrinsics] Optimize memset lowering

This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.

This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.

I'm planning similar treatment for memset.pattern as a follow-up PR.

For SWDEV-543208.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications