[all-commits] [llvm/llvm-project] a4fd3d: [AMDGPU] Use wider loop lowering type for LowerMem...

Mon Oct 28 01:04:40 PDT 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: a4fd3dba6e285734bc635b0651a30dfeffedeada
      https://github.com/llvm/llvm-project/commit/a4fd3dba6e285734bc635b0651a30dfeffedeada
  Author: Fabian Ritter <fabian.ritter at amd.com>
  Date:   2024-10-28 (Mon, 28 Oct 2024)

  Changed paths:
    M llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
    M llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
    M llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
    A llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll

  Log Message:
  -----------
  [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332)

When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.

This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.

The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.

Part of SWDEV-455845.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications