[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)
Fabian Ritter via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 15 01:31:23 PDT 2024
https://github.com/ritter-x2a created https://github.com/llvm/llvm-project/pull/112332
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per iteration. We can improve performance in some cases by emitting multiple load/store pairs per iteration. This patch achieves that by increasing the width of the loop lowering type in the GCN target and letting legalization split the resulting too-wide access pairs into multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>= 1024 bytes) constant lengths. Smaller constant lengths are handled by ISel directly; non-constant lengths would be slowed down by this change if the dynamic length was smaller or slightly larger than what an unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on gfx1030. This change leads to speedups of 15-38% for global memory and 1.9-5.8x for scratch in these microbenchmarks.
Part of SWDEV-455845.
>From e3a494303ecafc95b899340d8f70dca6a894ace9 Mon Sep 17 00:00:00 2001
From: Fabian Ritter <fabian.ritter at amd.com>
Date: Tue, 15 Oct 2024 04:20:12 -0400
Subject: [PATCH] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair per
iteration. We can improve performance in some cases by emitting multiple
load/store pairs per iteration. This patch achieves that by increasing the
width of the loop lowering type in the GCN target and letting legalization
split the resulting too-wide access pairs into multiple legal access pairs.
This change only affects lowered memcpys and memmoves with large (>= 1024
bytes) constant lengths. Smaller constant lengths are handled by ISel directly;
non-constant lengths would be slowed down by this change if the dynamic length
was smaller or slightly larger than what an unrolled iteration copies.
The chosen default unroll factor is the result of microbenchmarks on gfx1030.
This change leads to speedups of 15-38% for global memory and 1.9-5.8x for
scratch in these microbenchmarks.
Part of SWDEV-455845.
---
.../AMDGPU/AMDGPUTargetTransformInfo.cpp | 55 +-
.../Transforms/Utils/LowerMemIntrinsics.cpp | 12 +
.../CodeGen/AMDGPU/lower-mem-intrinsics.ll | 156 +-
.../CodeGen/AMDGPU/memintrinsic-unroll.ll | 2663 +++++++++++++++++
4 files changed, 2799 insertions(+), 87 deletions(-)
create mode 100644 llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 8f9495d83cde2d..2fd34476a1fc9b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
cl::desc("Maximum number of BBs allowed in a function after inlining"
" (compile time constraint)"));
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+ "amdgpu-memcpy-loop-unroll",
+ cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+ "operations when lowering memcpy as a loop, must be a power of 2"),
+ cl::init(16), cl::Hidden);
+
static bool dependsOnLocalPhi(const Loop *L, const Value *Cond,
unsigned Depth = 0) {
const Instruction *I = dyn_cast<Instruction>(Cond);
@@ -409,13 +416,8 @@ int64_t GCNTTIImpl::getMaxMemIntrinsicInlineSizeThreshold() const {
return 1024;
}
-// FIXME: Really we would like to issue multiple 128-bit loads and stores per
-// iteration. Should we report a larger size and let it legalize?
-//
// FIXME: Should we use narrower types for local/region, or account for when
// unaligned access is legal?
-//
-// FIXME: This could use fine tuning and microbenchmarks.
Type *GCNTTIImpl::getMemcpyLoopLoweringType(
LLVMContext &Context, Value *Length, unsigned SrcAddrSpace,
unsigned DestAddrSpace, Align SrcAlign, Align DestAlign,
@@ -442,9 +444,39 @@ Type *GCNTTIImpl::getMemcpyLoopLoweringType(
return FixedVectorType::get(Type::getInt32Ty(Context), 2);
}
- // Global memory works best with 16-byte accesses. Private memory will also
- // hit this, although they'll be decomposed.
- return FixedVectorType::get(Type::getInt32Ty(Context), 4);
+ // Global memory works best with 16-byte accesses.
+ // If the operation has a fixed known length that is large enough, it is
+ // worthwhile to return an even wider type and let legalization lower it into
+ // multiple accesses, effectively unrolling the memcpy loop. Private memory
+ // also hits this, although accesses may be decomposed.
+ //
+ // Don't unroll if
+ // - Length is not a constant, since unrolling leads to worse performance for
+ // length values that are smaller or slightly larger than the total size of
+ // the type returned here. Mitigating that would require a more complex
+ // lowering for variable-length memcpy and memmove.
+ // - the memory operations would be split further into byte-wise accesses
+ // because of their (mis)alignment, since that would lead to a huge code
+ // size increase.
+ unsigned I32EltsInVector = 4;
+ if (MemcpyLoopUnroll > 0 && isa<ConstantInt>(Length)) {
+ unsigned VectorSizeBytes = I32EltsInVector * 4;
+ unsigned VectorSizeBits = VectorSizeBytes * 8;
+ unsigned UnrolledVectorBytes = VectorSizeBytes * MemcpyLoopUnroll;
+ Align PartSrcAlign(commonAlignment(SrcAlign, UnrolledVectorBytes));
+ Align PartDestAlign(commonAlignment(DestAlign, UnrolledVectorBytes));
+
+ const SITargetLowering *TLI = this->getTLI();
+ bool SrcNotSplit = TLI->allowsMisalignedMemoryAccessesImpl(
+ VectorSizeBits, SrcAddrSpace, PartSrcAlign);
+ bool DestNotSplit = TLI->allowsMisalignedMemoryAccessesImpl(
+ VectorSizeBits, DestAddrSpace, PartDestAlign);
+ if (SrcNotSplit && DestNotSplit)
+ return FixedVectorType::get(Type::getInt32Ty(Context),
+ MemcpyLoopUnroll * I32EltsInVector);
+ }
+
+ return FixedVectorType::get(Type::getInt32Ty(Context), I32EltsInVector);
}
void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
@@ -452,7 +484,6 @@ void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
unsigned RemainingBytes, unsigned SrcAddrSpace, unsigned DestAddrSpace,
Align SrcAlign, Align DestAlign,
std::optional<uint32_t> AtomicCpySize) const {
- assert(RemainingBytes < 16);
if (AtomicCpySize)
BaseT::getMemcpyLoopResidualLoweringType(
@@ -462,6 +493,12 @@ void GCNTTIImpl::getMemcpyLoopResidualLoweringType(
Align MinAlign = std::min(SrcAlign, DestAlign);
if (MinAlign != Align(2)) {
+ Type *I32x4Ty = FixedVectorType::get(Type::getInt32Ty(Context), 4);
+ while (RemainingBytes >= 16) {
+ OpsOut.push_back(I32x4Ty);
+ RemainingBytes -= 16;
+ }
+
Type *I64Ty = Type::getInt64Ty(Context);
while (RemainingBytes >= 8) {
OpsOut.push_back(I64Ty);
diff --git a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
index ba62d75250c85e..7fce6fe355dccb 100644
--- a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
+++ b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
@@ -48,6 +48,9 @@ void llvm::createMemCpyLoopKnownSize(
Ctx, CopyLen, SrcAS, DstAS, SrcAlign, DstAlign, AtomicElementSize);
assert((!AtomicElementSize || !LoopOpType->isVectorTy()) &&
"Atomic memcpy lowering is not supported for vector operand type");
+ assert((DL.getTypeStoreSize(LoopOpType) == DL.getTypeAllocSize(LoopOpType)) &&
+ "Bytes are missed if store and alloc size of the LoopOpType do not "
+ "match");
unsigned LoopOpSize = DL.getTypeStoreSize(LoopOpType);
assert((!AtomicElementSize || LoopOpSize % *AtomicElementSize == 0) &&
@@ -199,6 +202,9 @@ void llvm::createMemCpyLoopUnknownSize(
Ctx, CopyLen, SrcAS, DstAS, SrcAlign, DstAlign, AtomicElementSize);
assert((!AtomicElementSize || !LoopOpType->isVectorTy()) &&
"Atomic memcpy lowering is not supported for vector operand type");
+ assert((DL.getTypeStoreSize(LoopOpType) == DL.getTypeAllocSize(LoopOpType)) &&
+ "Bytes are missed if store and alloc size of the LoopOpType do not "
+ "match");
unsigned LoopOpSize = DL.getTypeStoreSize(LoopOpType);
assert((!AtomicElementSize || LoopOpSize % *AtomicElementSize == 0) &&
"Atomic memcpy lowering is not supported for selected operand size");
@@ -411,6 +417,9 @@ static void createMemMoveLoopUnknownSize(Instruction *InsertBefore,
Type *LoopOpType = TTI.getMemcpyLoopLoweringType(Ctx, CopyLen, SrcAS, DstAS,
SrcAlign, DstAlign);
+ assert((DL.getTypeStoreSize(LoopOpType) == DL.getTypeAllocSize(LoopOpType)) &&
+ "Bytes are missed if store and alloc size of the LoopOpType do not "
+ "match");
unsigned LoopOpSize = DL.getTypeStoreSize(LoopOpType);
Type *Int8Type = Type::getInt8Ty(Ctx);
bool LoopOpIsInt8 = LoopOpType == Int8Type;
@@ -668,6 +677,9 @@ static void createMemMoveLoopKnownSize(Instruction *InsertBefore,
Type *LoopOpType = TTI.getMemcpyLoopLoweringType(Ctx, CopyLen, SrcAS, DstAS,
SrcAlign, DstAlign);
+ assert((DL.getTypeStoreSize(LoopOpType) == DL.getTypeAllocSize(LoopOpType)) &&
+ "Bytes are missed if store and alloc size of the LoopOpType do not "
+ "match");
unsigned LoopOpSize = DL.getTypeStoreSize(LoopOpType);
// Calculate the loop trip count and remaining bytes to copy after the loop.
diff --git a/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll b/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
index 9e2e37a886d1fe..f15202e9105301 100644
--- a/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
+++ b/llvm/test/CodeGen/AMDGPU/lower-mem-intrinsics.ll
@@ -396,7 +396,7 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
; MAX1024-NEXT: [[TMP15:%.*]] = icmp ult i64 [[TMP14]], [[TMP2]]
; MAX1024-NEXT: br i1 [[TMP15]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION:%.*]]
; MAX1024: post-loop-memcpy-expansion:
-; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) [[DST1:%.*]], ptr addrspace(1) [[SRC]], i64 102, i1 false)
+; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) [[DST1:%.*]], ptr addrspace(1) [[SRC]], i64 282, i1 false)
; MAX1024-NEXT: ret void
; MAX1024: loop-memcpy-residual-header:
; MAX1024-NEXT: [[TMP16:%.*]] = icmp ne i64 [[TMP2]], 0
@@ -436,16 +436,16 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
; ALL-NEXT: [[TMP18:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST1:%.*]], i64 [[LOOP_INDEX]]
; ALL-NEXT: store <4 x i32> [[TMP17]], ptr addrspace(1) [[TMP18]], align 1
; ALL-NEXT: [[TMP19]] = add i64 [[LOOP_INDEX]], 1
-; ALL-NEXT: [[TMP20:%.*]] = icmp ult i64 [[TMP19]], 6
+; ALL-NEXT: [[TMP20:%.*]] = icmp ult i64 [[TMP19]], 17
; ALL-NEXT: br i1 [[TMP20]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; ALL: memcpy-split:
-; ALL-NEXT: [[TMP21:%.*]] = getelementptr inbounds i32, ptr addrspace(1) [[SRC]], i64 24
-; ALL-NEXT: [[TMP22:%.*]] = load i32, ptr addrspace(1) [[TMP21]], align 1
-; ALL-NEXT: [[TMP23:%.*]] = getelementptr inbounds i32, ptr addrspace(1) [[DST1]], i64 24
-; ALL-NEXT: store i32 [[TMP22]], ptr addrspace(1) [[TMP23]], align 1
-; ALL-NEXT: [[TMP24:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[SRC]], i64 50
+; ALL-NEXT: [[TMP21:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 34
+; ALL-NEXT: [[TMP22:%.*]] = load i64, ptr addrspace(1) [[TMP21]], align 1
+; ALL-NEXT: [[TMP23:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[DST1]], i64 34
+; ALL-NEXT: store i64 [[TMP22]], ptr addrspace(1) [[TMP23]], align 1
+; ALL-NEXT: [[TMP24:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[SRC]], i64 140
; ALL-NEXT: [[TMP25:%.*]] = load i16, ptr addrspace(1) [[TMP24]], align 1
-; ALL-NEXT: [[TMP26:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[DST1]], i64 50
+; ALL-NEXT: [[TMP26:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[DST1]], i64 140
; ALL-NEXT: store i16 [[TMP25]], ptr addrspace(1) [[TMP26]], align 1
; ALL-NEXT: ret void
; ALL: loop-memcpy-residual-header:
@@ -453,7 +453,7 @@ define amdgpu_kernel void @memcpy_multi_use_one_function_keep_small(ptr addrspac
; ALL-NEXT: br i1 [[TMP27]], label [[LOOP_MEMCPY_RESIDUAL]], label [[POST_LOOP_MEMCPY_EXPANSION]]
;
call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst0, ptr addrspace(1) %src, i64 %n, i1 false)
- call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst1, ptr addrspace(1) %src, i64 102, i1 false)
+ call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst1, ptr addrspace(1) %src, i64 282, i1 false)
ret void
}
@@ -462,12 +462,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1028(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, ptr addrspace(1) [[SRC]], i64 256
@@ -485,12 +485,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1025(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr addrspace(1) [[SRC]], i64 1024
@@ -508,12 +508,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1026(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[SRC]], i64 512
@@ -531,12 +531,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1032(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 128
@@ -554,12 +554,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1034(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 128
@@ -581,12 +581,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1035(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 128
@@ -612,12 +612,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1036(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 128
@@ -639,12 +639,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1039(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr addrspace(1) [[SRC]], i64 128
@@ -697,12 +697,12 @@ define amdgpu_kernel void @memcpy_global_align4_global_align4_1027(ptr addrspace
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr addrspace(1) [[SRC]], i64 512
@@ -770,12 +770,12 @@ define amdgpu_kernel void @memcpy_private_align4_private_align4_1027(ptr addrspa
; OPT-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; OPT: load-store-loop:
; OPT-NEXT: [[LOOP_INDEX:%.*]] = phi i32 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
-; OPT-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(5) [[TMP1]], align 4
-; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(5) [[DST:%.*]], i32 [[LOOP_INDEX]]
-; OPT-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(5) [[TMP3]], align 4
+; OPT-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(5) [[SRC:%.*]], i32 [[LOOP_INDEX]]
+; OPT-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(5) [[TMP1]], align 4
+; OPT-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(5) [[DST:%.*]], i32 [[LOOP_INDEX]]
+; OPT-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(5) [[TMP3]], align 4
; OPT-NEXT: [[TMP4]] = add i32 [[LOOP_INDEX]], 1
-; OPT-NEXT: [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 64
+; OPT-NEXT: [[TMP5:%.*]] = icmp ult i32 [[TMP4]], 4
; OPT-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; OPT: memcpy-split:
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr addrspace(5) [[SRC]], i32 512
@@ -1203,26 +1203,26 @@ define amdgpu_kernel void @memcpy_global_align4_local_align4_variable(ptr addrsp
ret void
}
-define amdgpu_kernel void @memcpy_global_align4_global_align4_16(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
-; MAX1024-LABEL: @memcpy_global_align4_global_align4_16(
-; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 4 [[DST:%.*]], ptr addrspace(1) align 4 [[SRC:%.*]], i64 16, i1 false)
+define amdgpu_kernel void @memcpy_global_align4_global_align4_256(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
+; MAX1024-LABEL: @memcpy_global_align4_global_align4_256(
+; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 4 [[DST:%.*]], ptr addrspace(1) align 4 [[SRC:%.*]], i64 256, i1 false)
; MAX1024-NEXT: ret void
;
-; ALL-LABEL: @memcpy_global_align4_global_align4_16(
+; ALL-LABEL: @memcpy_global_align4_global_align4_256(
; ALL-NEXT: br label [[LOAD_STORE_LOOP:%.*]]
; ALL: load-store-loop:
; ALL-NEXT: [[LOOP_INDEX:%.*]] = phi i64 [ 0, [[TMP0:%.*]] ], [ [[TMP4:%.*]], [[LOAD_STORE_LOOP]] ]
-; ALL-NEXT: [[TMP1:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
-; ALL-NEXT: [[TMP2:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP1]], align 4
-; ALL-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
-; ALL-NEXT: store <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
+; ALL-NEXT: [[TMP1:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[SRC:%.*]], i64 [[LOOP_INDEX]]
+; ALL-NEXT: [[TMP2:%.*]] = load <64 x i32>, ptr addrspace(1) [[TMP1]], align 4
+; ALL-NEXT: [[TMP3:%.*]] = getelementptr inbounds <64 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
+; ALL-NEXT: store <64 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 4
; ALL-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
; ALL-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 1
; ALL-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; ALL: memcpy-split:
; ALL-NEXT: ret void
;
- call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 4 %dst, ptr addrspace(1) align 4 %src, i64 16, i1 false)
+ call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) align 4 %dst, ptr addrspace(1) align 4 %src, i64 256, i1 false)
ret void
}
@@ -2101,7 +2101,7 @@ define amdgpu_kernel void @memmove_global_align4_static_residual_empty(ptr addrs
; OPT-NEXT: [[COMPARE_SRC_DST:%.*]] = icmp ult ptr addrspace(1) [[SRC:%.*]], [[DST:%.*]]
; OPT-NEXT: br i1 [[COMPARE_SRC_DST]], label [[MEMMOVE_BWD_LOOP:%.*]], label [[MEMMOVE_FWD_LOOP:%.*]]
; OPT: memmove_bwd_loop:
-; OPT-NEXT: [[TMP1:%.*]] = phi i64 [ [[BWD_INDEX:%.*]], [[MEMMOVE_BWD_LOOP]] ], [ 65, [[TMP0:%.*]] ]
+; OPT-NEXT: [[TMP1:%.*]] = phi i64 [ [[BWD_INDEX:%.*]], [[MEMMOVE_BWD_LOOP]] ], [ 80, [[TMP0:%.*]] ]
; OPT-NEXT: [[BWD_INDEX]] = sub i64 [[TMP1]], 1
; OPT-NEXT: [[TMP2:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC]], i64 [[BWD_INDEX]]
; OPT-NEXT: [[ELEMENT:%.*]] = load <4 x i32>, ptr addrspace(1) [[TMP2]], align 1
@@ -2116,12 +2116,12 @@ define amdgpu_kernel void @memmove_global_align4_static_residual_empty(ptr addrs
; OPT-NEXT: [[TMP6:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST]], i64 [[FWD_INDEX]]
; OPT-NEXT: store <4 x i32> [[ELEMENT1]], ptr addrspace(1) [[TMP6]], align 1
; OPT-NEXT: [[TMP7]] = add i64 [[FWD_INDEX]], 1
-; OPT-NEXT: [[TMP8:%.*]] = icmp eq i64 [[TMP7]], 65
+; OPT-NEXT: [[TMP8:%.*]] = icmp eq i64 [[TMP7]], 80
; OPT-NEXT: br i1 [[TMP8]], label [[MEMMOVE_DONE]], label [[MEMMOVE_FWD_LOOP]]
; OPT: memmove_done:
; OPT-NEXT: ret void
;
- call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 1040, i1 false)
+ call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 1280, i1 false)
ret void
}
@@ -2234,14 +2234,14 @@ entry:
define amdgpu_kernel void @memmove_volatile(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
; MAX1024-LABEL: @memmove_volatile(
-; MAX1024-NEXT: call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) [[DST:%.*]], ptr addrspace(1) [[SRC:%.*]], i64 64, i1 true)
+; MAX1024-NEXT: call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) [[DST:%.*]], ptr addrspace(1) [[SRC:%.*]], i64 512, i1 true)
; MAX1024-NEXT: ret void
;
; ALL-LABEL: @memmove_volatile(
; ALL-NEXT: [[COMPARE_SRC_DST:%.*]] = icmp ult ptr addrspace(1) [[SRC:%.*]], [[DST:%.*]]
; ALL-NEXT: br i1 [[COMPARE_SRC_DST]], label [[MEMMOVE_BWD_LOOP:%.*]], label [[MEMMOVE_FWD_LOOP:%.*]]
; ALL: memmove_bwd_loop:
-; ALL-NEXT: [[TMP1:%.*]] = phi i64 [ [[BWD_INDEX:%.*]], [[MEMMOVE_BWD_LOOP]] ], [ 4, [[TMP0:%.*]] ]
+; ALL-NEXT: [[TMP1:%.*]] = phi i64 [ [[BWD_INDEX:%.*]], [[MEMMOVE_BWD_LOOP]] ], [ 32, [[TMP0:%.*]] ]
; ALL-NEXT: [[BWD_INDEX]] = sub i64 [[TMP1]], 1
; ALL-NEXT: [[TMP2:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[SRC]], i64 [[BWD_INDEX]]
; ALL-NEXT: [[ELEMENT:%.*]] = load volatile <4 x i32>, ptr addrspace(1) [[TMP2]], align 1
@@ -2256,18 +2256,18 @@ define amdgpu_kernel void @memmove_volatile(ptr addrspace(1) %dst, ptr addrspace
; ALL-NEXT: [[TMP6:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST]], i64 [[FWD_INDEX]]
; ALL-NEXT: store volatile <4 x i32> [[ELEMENT1]], ptr addrspace(1) [[TMP6]], align 1
; ALL-NEXT: [[TMP7]] = add i64 [[FWD_INDEX]], 1
-; ALL-NEXT: [[TMP8:%.*]] = icmp eq i64 [[TMP7]], 4
+; ALL-NEXT: [[TMP8:%.*]] = icmp eq i64 [[TMP7]], 32
; ALL-NEXT: br i1 [[TMP8]], label [[MEMMOVE_DONE]], label [[MEMMOVE_FWD_LOOP]]
; ALL: memmove_done:
; ALL-NEXT: ret void
;
- call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 64, i1 true)
+ call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 512, i1 true)
ret void
}
define amdgpu_kernel void @memcpy_volatile(ptr addrspace(1) %dst, ptr addrspace(1) %src) #0 {
; MAX1024-LABEL: @memcpy_volatile(
-; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) [[DST:%.*]], ptr addrspace(1) [[SRC:%.*]], i64 64, i1 true)
+; MAX1024-NEXT: call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) [[DST:%.*]], ptr addrspace(1) [[SRC:%.*]], i64 512, i1 true)
; MAX1024-NEXT: ret void
;
; ALL-LABEL: @memcpy_volatile(
@@ -2279,12 +2279,12 @@ define amdgpu_kernel void @memcpy_volatile(ptr addrspace(1) %dst, ptr addrspace(
; ALL-NEXT: [[TMP3:%.*]] = getelementptr inbounds <4 x i32>, ptr addrspace(1) [[DST:%.*]], i64 [[LOOP_INDEX]]
; ALL-NEXT: store volatile <4 x i32> [[TMP2]], ptr addrspace(1) [[TMP3]], align 1
; ALL-NEXT: [[TMP4]] = add i64 [[LOOP_INDEX]], 1
-; ALL-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 4
+; ALL-NEXT: [[TMP5:%.*]] = icmp ult i64 [[TMP4]], 32
; ALL-NEXT: br i1 [[TMP5]], label [[LOAD_STORE_LOOP]], label [[MEMCPY_SPLIT:%.*]]
; ALL: memcpy-split:
; ALL-NEXT: ret void
;
- call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 64, i1 true)
+ call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) %dst, ptr addrspace(1) %src, i64 512, i1 true)
ret void
}
diff --git a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
new file mode 100644
index 00000000000000..f0a31a3e62f43e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
@@ -0,0 +1,2663 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 %s -o - | FileCheck %s
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -mattr=-unaligned-access-mode %s -o - | FileCheck -check-prefix=ALIGNED %s
+
+; For checking that LowerMemIntrinsics lowers memcpy and memmove with large
+; constant copy-sizes into loops with multiple load/store pairs.
+
+
+; memcpy for address spaces 0, 1, 4, 5
+
+define void @memcpy_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0) align 1 readonly %src) {
+; CHECK-LABEL: memcpy_p0_p0_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b64 s[6:7], 0
+; CHECK-NEXT: .LBB0_1: ; %load-store-loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: flat_load_dwordx4 v[4:7], v[2:3] offset:224
+; CHECK-NEXT: flat_load_dwordx4 v[8:11], v[2:3] offset:240
+; CHECK-NEXT: flat_load_dwordx4 v[12:15], v[2:3] offset:192
+; CHECK-NEXT: flat_load_dwordx4 v[16:19], v[2:3] offset:208
+; CHECK-NEXT: flat_load_dwordx4 v[20:23], v[2:3] offset:160
+; CHECK-NEXT: flat_load_dwordx4 v[24:27], v[2:3] offset:176
+; CHECK-NEXT: flat_load_dwordx4 v[28:31], v[2:3] offset:128
+; CHECK-NEXT: flat_load_dwordx4 v[32:35], v[2:3] offset:144
+; CHECK-NEXT: flat_load_dwordx4 v[36:39], v[2:3] offset:96
+; CHECK-NEXT: flat_load_dwordx4 v[48:51], v[2:3] offset:112
+; CHECK-NEXT: flat_load_dwordx4 v[52:55], v[2:3] offset:64
+; CHECK-NEXT: flat_load_dwordx4 v[64:67], v[2:3] offset:80
+; CHECK-NEXT: flat_load_dwordx4 v[68:71], v[2:3] offset:32
+; CHECK-NEXT: flat_load_dwordx4 v[80:83], v[2:3] offset:48
+; CHECK-NEXT: flat_load_dwordx4 v[84:87], v[2:3]
+; CHECK-NEXT: flat_load_dwordx4 v[96:99], v[2:3] offset:16
+; CHECK-NEXT: s_add_u32 s6, s6, 1
+; CHECK-NEXT: s_addc_u32 s7, s7, 0
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, 0x100, v2
+; CHECK-NEXT: v_cmp_lt_u64_e64 s4, s[6:7], 8
+; CHECK-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; CHECK-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[4:7] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[8:11] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[12:15] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[16:19] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[20:23] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[24:27] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[28:31] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[32:35] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[36:39] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[48:51] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[52:55] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[64:67] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[68:71] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[80:83] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[84:87]
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[96:99] offset:16
+; CHECK-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; CHECK-NEXT: v_add_co_u32 v0, s4, 0x100, v0
+; CHECK-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; CHECK-NEXT: s_cbranch_vccnz .LBB0_1
+; CHECK-NEXT: ; %bb.2: ; %memcpy-split
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memcpy_p0_p0_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b64 s[6:7], 0
+; ALIGNED-NEXT: .LBB0_1: ; %load-store-loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: flat_load_ubyte v4, v[2:3] offset:2
+; ALIGNED-NEXT: flat_load_ubyte v5, v[2:3] offset:3
+; ALIGNED-NEXT: flat_load_ubyte v6, v[2:3]
+; ALIGNED-NEXT: flat_load_ubyte v7, v[2:3] offset:1
+; ALIGNED-NEXT: flat_load_ubyte v8, v[2:3] offset:6
+; ALIGNED-NEXT: flat_load_ubyte v9, v[2:3] offset:7
+; ALIGNED-NEXT: flat_load_ubyte v10, v[2:3] offset:4
+; ALIGNED-NEXT: flat_load_ubyte v11, v[2:3] offset:5
+; ALIGNED-NEXT: flat_load_ubyte v12, v[2:3] offset:10
+; ALIGNED-NEXT: flat_load_ubyte v13, v[2:3] offset:11
+; ALIGNED-NEXT: flat_load_ubyte v14, v[2:3] offset:8
+; ALIGNED-NEXT: flat_load_ubyte v15, v[2:3] offset:9
+; ALIGNED-NEXT: flat_load_ubyte v16, v[2:3] offset:14
+; ALIGNED-NEXT: flat_load_ubyte v17, v[2:3] offset:15
+; ALIGNED-NEXT: flat_load_ubyte v18, v[2:3] offset:12
+; ALIGNED-NEXT: flat_load_ubyte v19, v[2:3] offset:13
+; ALIGNED-NEXT: s_add_u32 s6, s6, 1
+; ALIGNED-NEXT: s_addc_u32 s7, s7, 0
+; ALIGNED-NEXT: v_add_co_u32 v2, vcc_lo, v2, 16
+; ALIGNED-NEXT: v_cmp_gt_u64_e64 s4, 0x80, s[6:7]
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; ALIGNED-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v4 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v5 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v6
+; ALIGNED-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v7 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v8 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v9 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v10 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v11 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v12 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v13 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v14 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v15 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v16 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v17 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v18 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v19 offset:13
+; ALIGNED-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; ALIGNED-NEXT: v_add_co_u32 v0, s4, v0, 16
+; ALIGNED-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; ALIGNED-NEXT: s_cbranch_vccnz .LBB0_1
+; ALIGNED-NEXT: ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p0.p0.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(0) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memcpy_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align 1 readonly %src) {
+; CHECK-LABEL: memcpy_p1_p1_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b64 s[6:7], 0
+; CHECK-NEXT: .LBB1_1: ; %load-store-loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[2:3], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[2:3], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[2:3], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[2:3], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[2:3], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[2:3], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[2:3], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[2:3], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[2:3], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[2:3], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[2:3], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[2:3], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[2:3], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[2:3], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[2:3], off
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[2:3], off offset:16
+; CHECK-NEXT: s_add_u32 s6, s6, 1
+; CHECK-NEXT: s_addc_u32 s7, s7, 0
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, 0x100, v2
+; CHECK-NEXT: v_cmp_lt_u64_e64 s4, s[6:7], 8
+; CHECK-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[4:7], off offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[8:11], off offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[12:15], off offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[16:19], off offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[20:23], off offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[24:27], off offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[28:31], off offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[32:35], off offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[36:39], off offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[48:51], off offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[52:55], off offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[64:67], off offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[68:71], off offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[80:83], off offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[84:87], off
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_dwordx4 v[0:1], v[96:99], off offset:16
+; CHECK-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; CHECK-NEXT: v_add_co_u32 v0, s4, 0x100, v0
+; CHECK-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; CHECK-NEXT: s_cbranch_vccnz .LBB1_1
+; CHECK-NEXT: ; %bb.2: ; %memcpy-split
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memcpy_p1_p1_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b64 s[6:7], 0
+; ALIGNED-NEXT: .LBB1_1: ; %load-store-loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v4, v[2:3], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v5, v[2:3], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v6, v[2:3], off
+; ALIGNED-NEXT: global_load_ubyte v7, v[2:3], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v8, v[2:3], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v9, v[2:3], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v10, v[2:3], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v11, v[2:3], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v12, v[2:3], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v13, v[2:3], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v14, v[2:3], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v15, v[2:3], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v16, v[2:3], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v17, v[2:3], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v18, v[2:3], off offset:12
+; ALIGNED-NEXT: global_load_ubyte v19, v[2:3], off offset:13
+; ALIGNED-NEXT: s_add_u32 s6, s6, 1
+; ALIGNED-NEXT: s_addc_u32 s7, s7, 0
+; ALIGNED-NEXT: v_add_co_u32 v2, vcc_lo, v2, 16
+; ALIGNED-NEXT: v_cmp_gt_u64_e64 s4, 0x80, s[6:7]
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: global_store_byte v[0:1], v4, off offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: global_store_byte v[0:1], v5, off offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: global_store_byte v[0:1], v6, off
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: global_store_byte v[0:1], v7, off offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: global_store_byte v[0:1], v8, off offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: global_store_byte v[0:1], v9, off offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: global_store_byte v[0:1], v10, off offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: global_store_byte v[0:1], v11, off offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: global_store_byte v[0:1], v12, off offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: global_store_byte v[0:1], v13, off offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: global_store_byte v[0:1], v14, off offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: global_store_byte v[0:1], v15, off offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: global_store_byte v[0:1], v16, off offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: global_store_byte v[0:1], v17, off offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: global_store_byte v[0:1], v18, off offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: global_store_byte v[0:1], v19, off offset:13
+; ALIGNED-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; ALIGNED-NEXT: v_add_co_u32 v0, s4, v0, 16
+; ALIGNED-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; ALIGNED-NEXT: s_cbranch_vccnz .LBB1_1
+; ALIGNED-NEXT: ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) noundef nonnull align 1 %dst, ptr addrspace(1) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memcpy_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align 1 readonly %src) {
+; CHECK-LABEL: memcpy_p0_p4_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b64 s[6:7], 0
+; CHECK-NEXT: .LBB2_1: ; %load-store-loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[2:3], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[2:3], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[2:3], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[2:3], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[2:3], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[2:3], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[2:3], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[2:3], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[2:3], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[2:3], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[2:3], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[2:3], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[2:3], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[2:3], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[2:3], off offset:16
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[2:3], off
+; CHECK-NEXT: s_add_u32 s6, s6, 1
+; CHECK-NEXT: s_addc_u32 s7, s7, 0
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, 0x100, v2
+; CHECK-NEXT: v_cmp_lt_u64_e64 s4, s[6:7], 8
+; CHECK-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[4:7] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[8:11] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[12:15] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[16:19] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[20:23] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[24:27] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[28:31] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[32:35] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[36:39] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[48:51] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[52:55] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[64:67] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[68:71] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[80:83] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[84:87] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[96:99]
+; CHECK-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; CHECK-NEXT: v_add_co_u32 v0, s4, 0x100, v0
+; CHECK-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; CHECK-NEXT: s_cbranch_vccnz .LBB2_1
+; CHECK-NEXT: ; %bb.2: ; %memcpy-split
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memcpy_p0_p4_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b64 s[6:7], 0
+; ALIGNED-NEXT: .LBB2_1: ; %load-store-loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v4, v[2:3], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v5, v[2:3], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v6, v[2:3], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v7, v[2:3], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v8, v[2:3], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v9, v[2:3], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v10, v[2:3], off
+; ALIGNED-NEXT: global_load_ubyte v11, v[2:3], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v12, v[2:3], off offset:13
+; ALIGNED-NEXT: global_load_ubyte v13, v[2:3], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v14, v[2:3], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v15, v[2:3], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v16, v[2:3], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v17, v[2:3], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v18, v[2:3], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v19, v[2:3], off offset:12
+; ALIGNED-NEXT: s_add_u32 s6, s6, 1
+; ALIGNED-NEXT: s_addc_u32 s7, s7, 0
+; ALIGNED-NEXT: v_add_co_u32 v2, vcc_lo, v2, 16
+; ALIGNED-NEXT: v_cmp_gt_u64_e64 s4, 0x80, s[6:7]
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v7 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v8 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v9 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v10
+; ALIGNED-NEXT: flat_store_byte v[0:1], v6 offset:7
+; ALIGNED-NEXT: flat_store_byte v[0:1], v5 offset:6
+; ALIGNED-NEXT: flat_store_byte v[0:1], v4 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v11 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v15 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v16 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v17 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v18 offset:8
+; ALIGNED-NEXT: flat_store_byte v[0:1], v14 offset:15
+; ALIGNED-NEXT: flat_store_byte v[0:1], v13 offset:14
+; ALIGNED-NEXT: flat_store_byte v[0:1], v12 offset:13
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v19 offset:12
+; ALIGNED-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; ALIGNED-NEXT: v_add_co_u32 v0, s4, v0, 16
+; ALIGNED-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; ALIGNED-NEXT: s_cbranch_vccnz .LBB2_1
+; ALIGNED-NEXT: ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p0.p4.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(4) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memcpy_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5) align 1 readonly %src) {
+; CHECK-LABEL: memcpy_p5_p5_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: .LBB3_1: ; %load-store-loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v2, v1, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v3, v1, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v4, v1, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v5, v1, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v6, v1, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v7, v1, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v8, v1, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v9, v1, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v10, v1, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v11, v1, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v12, v1, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v13, v1, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v14, v1, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v15, v1, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v16, v1, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v17, v1, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v18, v1, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v19, v1, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v20, v1, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v21, v1, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v22, v1, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v23, v1, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v24, v1, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v25, v1, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v26, v1, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v27, v1, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v28, v1, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v29, v1, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v30, v1, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v31, v1, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v32, v1, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v33, v1, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v34, v1, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v35, v1, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v36, v1, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v37, v1, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v38, v1, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v39, v1, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v48, v1, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v49, v1, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v50, v1, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v51, v1, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v52, v1, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v53, v1, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v54, v1, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v55, v1, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v64, v1, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v65, v1, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v66, v1, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v67, v1, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v68, v1, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v69, v1, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v70, v1, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v71, v1, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v80, v1, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v81, v1, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v82, v1, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v83, v1, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v84, v1, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v85, v1, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v86, v1, s[0:3], 0 offen offset:12
+; CHECK-NEXT: buffer_load_dword v87, v1, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v96, v1, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v97, v1, s[0:3], 0 offen
+; CHECK-NEXT: s_add_u32 s4, s4, 1
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: v_add_nc_u32_e32 v1, 0x100, v1
+; CHECK-NEXT: v_cmp_lt_u64_e64 s6, s[4:5], 8
+; CHECK-NEXT: s_waitcnt vmcnt(62)
+; CHECK-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_store_dword v3, v0, s[0:3], 0 offen offset:248
+; CHECK-NEXT: s_waitcnt vmcnt(61)
+; CHECK-NEXT: buffer_store_dword v4, v0, s[0:3], 0 offen offset:244
+; CHECK-NEXT: s_waitcnt vmcnt(60)
+; CHECK-NEXT: buffer_store_dword v5, v0, s[0:3], 0 offen offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(59)
+; CHECK-NEXT: buffer_store_dword v6, v0, s[0:3], 0 offen offset:236
+; CHECK-NEXT: s_waitcnt vmcnt(58)
+; CHECK-NEXT: buffer_store_dword v7, v0, s[0:3], 0 offen offset:232
+; CHECK-NEXT: s_waitcnt vmcnt(57)
+; CHECK-NEXT: buffer_store_dword v8, v0, s[0:3], 0 offen offset:228
+; CHECK-NEXT: s_waitcnt vmcnt(56)
+; CHECK-NEXT: buffer_store_dword v9, v0, s[0:3], 0 offen offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(55)
+; CHECK-NEXT: buffer_store_dword v10, v0, s[0:3], 0 offen offset:220
+; CHECK-NEXT: s_waitcnt vmcnt(54)
+; CHECK-NEXT: buffer_store_dword v11, v0, s[0:3], 0 offen offset:216
+; CHECK-NEXT: s_waitcnt vmcnt(53)
+; CHECK-NEXT: buffer_store_dword v12, v0, s[0:3], 0 offen offset:212
+; CHECK-NEXT: s_waitcnt vmcnt(52)
+; CHECK-NEXT: buffer_store_dword v13, v0, s[0:3], 0 offen offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(51)
+; CHECK-NEXT: buffer_store_dword v14, v0, s[0:3], 0 offen offset:204
+; CHECK-NEXT: s_waitcnt vmcnt(50)
+; CHECK-NEXT: buffer_store_dword v15, v0, s[0:3], 0 offen offset:200
+; CHECK-NEXT: s_waitcnt vmcnt(49)
+; CHECK-NEXT: buffer_store_dword v16, v0, s[0:3], 0 offen offset:196
+; CHECK-NEXT: s_waitcnt vmcnt(48)
+; CHECK-NEXT: buffer_store_dword v17, v0, s[0:3], 0 offen offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(47)
+; CHECK-NEXT: buffer_store_dword v18, v0, s[0:3], 0 offen offset:188
+; CHECK-NEXT: s_waitcnt vmcnt(46)
+; CHECK-NEXT: buffer_store_dword v19, v0, s[0:3], 0 offen offset:184
+; CHECK-NEXT: s_waitcnt vmcnt(45)
+; CHECK-NEXT: buffer_store_dword v20, v0, s[0:3], 0 offen offset:180
+; CHECK-NEXT: s_waitcnt vmcnt(44)
+; CHECK-NEXT: buffer_store_dword v21, v0, s[0:3], 0 offen offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(43)
+; CHECK-NEXT: buffer_store_dword v22, v0, s[0:3], 0 offen offset:172
+; CHECK-NEXT: s_waitcnt vmcnt(42)
+; CHECK-NEXT: buffer_store_dword v23, v0, s[0:3], 0 offen offset:168
+; CHECK-NEXT: s_waitcnt vmcnt(41)
+; CHECK-NEXT: buffer_store_dword v24, v0, s[0:3], 0 offen offset:164
+; CHECK-NEXT: s_waitcnt vmcnt(40)
+; CHECK-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(39)
+; CHECK-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:156
+; CHECK-NEXT: s_waitcnt vmcnt(38)
+; CHECK-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:152
+; CHECK-NEXT: s_waitcnt vmcnt(37)
+; CHECK-NEXT: buffer_store_dword v28, v0, s[0:3], 0 offen offset:148
+; CHECK-NEXT: s_waitcnt vmcnt(36)
+; CHECK-NEXT: buffer_store_dword v29, v0, s[0:3], 0 offen offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(35)
+; CHECK-NEXT: buffer_store_dword v30, v0, s[0:3], 0 offen offset:140
+; CHECK-NEXT: s_waitcnt vmcnt(34)
+; CHECK-NEXT: buffer_store_dword v31, v0, s[0:3], 0 offen offset:136
+; CHECK-NEXT: s_waitcnt vmcnt(33)
+; CHECK-NEXT: buffer_store_dword v32, v0, s[0:3], 0 offen offset:132
+; CHECK-NEXT: s_waitcnt vmcnt(32)
+; CHECK-NEXT: buffer_store_dword v33, v0, s[0:3], 0 offen offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(31)
+; CHECK-NEXT: buffer_store_dword v34, v0, s[0:3], 0 offen offset:124
+; CHECK-NEXT: s_waitcnt vmcnt(30)
+; CHECK-NEXT: buffer_store_dword v35, v0, s[0:3], 0 offen offset:120
+; CHECK-NEXT: s_waitcnt vmcnt(29)
+; CHECK-NEXT: buffer_store_dword v36, v0, s[0:3], 0 offen offset:116
+; CHECK-NEXT: s_waitcnt vmcnt(28)
+; CHECK-NEXT: buffer_store_dword v37, v0, s[0:3], 0 offen offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(27)
+; CHECK-NEXT: buffer_store_dword v38, v0, s[0:3], 0 offen offset:108
+; CHECK-NEXT: s_waitcnt vmcnt(26)
+; CHECK-NEXT: buffer_store_dword v39, v0, s[0:3], 0 offen offset:104
+; CHECK-NEXT: s_waitcnt vmcnt(25)
+; CHECK-NEXT: buffer_store_dword v48, v0, s[0:3], 0 offen offset:100
+; CHECK-NEXT: s_waitcnt vmcnt(24)
+; CHECK-NEXT: buffer_store_dword v49, v0, s[0:3], 0 offen offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(23)
+; CHECK-NEXT: buffer_store_dword v50, v0, s[0:3], 0 offen offset:92
+; CHECK-NEXT: s_waitcnt vmcnt(22)
+; CHECK-NEXT: buffer_store_dword v51, v0, s[0:3], 0 offen offset:88
+; CHECK-NEXT: s_waitcnt vmcnt(21)
+; CHECK-NEXT: buffer_store_dword v52, v0, s[0:3], 0 offen offset:84
+; CHECK-NEXT: s_waitcnt vmcnt(20)
+; CHECK-NEXT: buffer_store_dword v53, v0, s[0:3], 0 offen offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(19)
+; CHECK-NEXT: buffer_store_dword v54, v0, s[0:3], 0 offen offset:76
+; CHECK-NEXT: s_waitcnt vmcnt(18)
+; CHECK-NEXT: buffer_store_dword v55, v0, s[0:3], 0 offen offset:72
+; CHECK-NEXT: s_waitcnt vmcnt(17)
+; CHECK-NEXT: buffer_store_dword v64, v0, s[0:3], 0 offen offset:68
+; CHECK-NEXT: s_waitcnt vmcnt(16)
+; CHECK-NEXT: buffer_store_dword v65, v0, s[0:3], 0 offen offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: buffer_store_dword v66, v0, s[0:3], 0 offen offset:60
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: buffer_store_dword v67, v0, s[0:3], 0 offen offset:56
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: buffer_store_dword v68, v0, s[0:3], 0 offen offset:52
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: buffer_store_dword v69, v0, s[0:3], 0 offen offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: buffer_store_dword v70, v0, s[0:3], 0 offen offset:44
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: buffer_store_dword v71, v0, s[0:3], 0 offen offset:40
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: buffer_store_dword v80, v0, s[0:3], 0 offen offset:36
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: buffer_store_dword v81, v0, s[0:3], 0 offen offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: buffer_store_dword v82, v0, s[0:3], 0 offen offset:28
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: buffer_store_dword v83, v0, s[0:3], 0 offen offset:24
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: buffer_store_dword v84, v0, s[0:3], 0 offen offset:20
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: buffer_store_dword v85, v0, s[0:3], 0 offen offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: buffer_store_dword v86, v0, s[0:3], 0 offen offset:12
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: buffer_store_dword v87, v0, s[0:3], 0 offen offset:8
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: buffer_store_dword v96, v0, s[0:3], 0 offen offset:4
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: buffer_store_dword v97, v0, s[0:3], 0 offen
+; CHECK-NEXT: v_add_nc_u32_e32 v0, 0x100, v0
+; CHECK-NEXT: s_and_b32 vcc_lo, exec_lo, s6
+; CHECK-NEXT: s_cbranch_vccnz .LBB3_1
+; CHECK-NEXT: ; %bb.2: ; %memcpy-split
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memcpy_p5_p5_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0
+; ALIGNED-NEXT: .LBB3_1: ; %load-store-loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v2, v1, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v3, v1, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v4, v1, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v5, v1, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v6, v1, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v7, v1, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v8, v1, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v9, v1, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v11, v1, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v12, v1, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v13, v1, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v14, v1, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v15, v1, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v16, v1, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v17, v1, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: s_add_u32 s4, s4, 1
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: v_add_nc_u32_e32 v1, 16, v1
+; ALIGNED-NEXT: v_cmp_gt_u64_e64 s6, 0x80, s[4:5]
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: buffer_store_byte v2, v0, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: buffer_store_byte v3, v0, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: buffer_store_byte v4, v0, s[0:3], 0 offen
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: buffer_store_byte v5, v0, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: buffer_store_byte v6, v0, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: buffer_store_byte v7, v0, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: buffer_store_byte v8, v0, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: buffer_store_byte v9, v0, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: buffer_store_byte v10, v0, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: buffer_store_byte v11, v0, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: buffer_store_byte v12, v0, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: buffer_store_byte v13, v0, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: buffer_store_byte v14, v0, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: buffer_store_byte v15, v0, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: buffer_store_byte v16, v0, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: buffer_store_byte v17, v0, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_nc_u32_e32 v0, 16, v0
+; ALIGNED-NEXT: s_and_b32 vcc_lo, exec_lo, s6
+; ALIGNED-NEXT: s_cbranch_vccnz .LBB3_1
+; ALIGNED-NEXT: ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) noundef nonnull align 1 %dst, ptr addrspace(5) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memcpy_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5) align 1 readonly %src) {
+; CHECK-LABEL: memcpy_p0_p5_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b64 s[6:7], 0
+; CHECK-NEXT: .LBB4_1: ; %load-store-loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v18, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v17, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v16, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v15, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v22, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v21, v2, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v20, v2, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v19, v2, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v26, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v25, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v24, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v23, v2, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v30, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v29, v2, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v28, v2, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v27, v2, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v34, v2, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v33, v2, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v32, v2, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v31, v2, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v38, v2, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v37, v2, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v36, v2, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v35, v2, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v51, v2, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v50, v2, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v49, v2, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v48, v2, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v55, v2, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v54, v2, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v53, v2, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v52, v2, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v67, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v66, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v65, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v64, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v71, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v70, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v69, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v68, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v83, v2, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v82, v2, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v81, v2, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v80, v2, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v96, v2, s[0:3], 0 offen
+; CHECK-NEXT: buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT: s_add_u32 s6, s6, 1
+; CHECK-NEXT: s_addc_u32 s7, s7, 0
+; CHECK-NEXT: v_add_nc_u32_e32 v2, 0x100, v2
+; CHECK-NEXT: v_cmp_lt_u64_e64 s4, s[6:7], 8
+; CHECK-NEXT: s_waitcnt vmcnt(41)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[23:26] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(37)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[27:30] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(33)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[31:34] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(29)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[35:38] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(25)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[48:51] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(21)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[52:55] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(17)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[64:67] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[68:71] offset:128
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[15:18] offset:112
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[19:22] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[80:83] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[84:87] offset:64
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[11:14] offset:48
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[7:10] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[3:6] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[0:1], v[96:99]
+; CHECK-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; CHECK-NEXT: v_add_co_u32 v0, s4, 0x100, v0
+; CHECK-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; CHECK-NEXT: s_cbranch_vccnz .LBB4_1
+; CHECK-NEXT: ; %bb.2: ; %memcpy-split
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memcpy_p0_p5_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b64 s[6:7], 0
+; ALIGNED-NEXT: .LBB4_1: ; %load-store-loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v3, v2, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v4, v2, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v5, v2, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v7, v2, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v12, v2, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v13, v2, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v14, v2, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v15, v2, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v16, v2, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v17, v2, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v18, v2, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: s_add_u32 s6, s6, 1
+; ALIGNED-NEXT: s_addc_u32 s7, s7, 0
+; ALIGNED-NEXT: v_add_nc_u32_e32 v2, 16, v2
+; ALIGNED-NEXT: v_cmp_gt_u64_e64 s4, 0x80, s[6:7]
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v3 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v4 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v5
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v6 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v7 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v8 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v9 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v10 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v11 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v12 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v13 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v14 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v15 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v16 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v17 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[0:1], v18 offset:13
+; ALIGNED-NEXT: s_and_b32 vcc_lo, exec_lo, s4
+; ALIGNED-NEXT: v_add_co_u32 v0, s4, v0, 16
+; ALIGNED-NEXT: v_add_co_ci_u32_e64 v1, s4, 0, v1, s4
+; ALIGNED-NEXT: s_cbranch_vccnz .LBB4_1
+; ALIGNED-NEXT: ; %bb.2: ; %memcpy-split
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memcpy.p0.p5.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(5) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+
+; memmove for address spaces 0, 1, 4, 5
+
+define void @memmove_p0_p0_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(0) align 1 readonly %src) {
+; CHECK-LABEL: memmove_p0_p0_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b32 s4, exec_lo
+; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; CHECK-NEXT: s_xor_b32 s6, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execz .LBB5_3
+; CHECK-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: .LBB5_2: ; %memmove_fwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: flat_load_dwordx4 v[4:7], v[96:97] offset:224
+; CHECK-NEXT: flat_load_dwordx4 v[8:11], v[96:97] offset:240
+; CHECK-NEXT: flat_load_dwordx4 v[12:15], v[96:97] offset:192
+; CHECK-NEXT: flat_load_dwordx4 v[16:19], v[96:97] offset:208
+; CHECK-NEXT: flat_load_dwordx4 v[20:23], v[96:97] offset:160
+; CHECK-NEXT: flat_load_dwordx4 v[24:27], v[96:97] offset:176
+; CHECK-NEXT: flat_load_dwordx4 v[28:31], v[96:97] offset:128
+; CHECK-NEXT: flat_load_dwordx4 v[32:35], v[96:97] offset:144
+; CHECK-NEXT: flat_load_dwordx4 v[36:39], v[96:97] offset:96
+; CHECK-NEXT: flat_load_dwordx4 v[48:51], v[96:97] offset:112
+; CHECK-NEXT: flat_load_dwordx4 v[52:55], v[96:97] offset:64
+; CHECK-NEXT: flat_load_dwordx4 v[64:67], v[96:97] offset:80
+; CHECK-NEXT: flat_load_dwordx4 v[68:71], v[96:97] offset:32
+; CHECK-NEXT: flat_load_dwordx4 v[80:83], v[96:97] offset:48
+; CHECK-NEXT: flat_load_dwordx4 v[84:87], v[96:97]
+; CHECK-NEXT: flat_load_dwordx4 v[96:99], v[96:97] offset:16
+; CHECK-NEXT: s_add_u32 s4, s4, 0x100
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[4:7] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[8:11] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[12:15] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[16:19] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[20:23] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[24:27] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[28:31] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[32:35] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[36:39] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87]
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99] offset:16
+; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; CHECK-NEXT: s_cbranch_scc1 .LBB5_2
+; CHECK-NEXT: .LBB5_3: ; %Flow9
+; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
+; CHECK-NEXT: s_cbranch_execz .LBB5_6
+; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; CHECK-NEXT: s_movk_i32 s6, 0xff00
+; CHECK-NEXT: s_mov_b64 s[4:5], 0x700
+; CHECK-NEXT: s_mov_b32 s7, -1
+; CHECK-NEXT: .LBB5_5: ; %memmove_bwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: flat_load_dwordx4 v[4:7], v[96:97] offset:224
+; CHECK-NEXT: flat_load_dwordx4 v[8:11], v[96:97] offset:240
+; CHECK-NEXT: flat_load_dwordx4 v[12:15], v[96:97] offset:192
+; CHECK-NEXT: flat_load_dwordx4 v[16:19], v[96:97] offset:208
+; CHECK-NEXT: flat_load_dwordx4 v[20:23], v[96:97] offset:160
+; CHECK-NEXT: flat_load_dwordx4 v[24:27], v[96:97] offset:176
+; CHECK-NEXT: flat_load_dwordx4 v[28:31], v[96:97] offset:128
+; CHECK-NEXT: flat_load_dwordx4 v[32:35], v[96:97] offset:144
+; CHECK-NEXT: flat_load_dwordx4 v[36:39], v[96:97] offset:96
+; CHECK-NEXT: flat_load_dwordx4 v[48:51], v[96:97] offset:112
+; CHECK-NEXT: flat_load_dwordx4 v[52:55], v[96:97] offset:64
+; CHECK-NEXT: flat_load_dwordx4 v[64:67], v[96:97] offset:80
+; CHECK-NEXT: flat_load_dwordx4 v[68:71], v[96:97] offset:32
+; CHECK-NEXT: flat_load_dwordx4 v[80:83], v[96:97] offset:48
+; CHECK-NEXT: flat_load_dwordx4 v[84:87], v[96:97]
+; CHECK-NEXT: flat_load_dwordx4 v[96:99], v[96:97] offset:16
+; CHECK-NEXT: s_add_u32 s4, s4, 0xffffff00
+; CHECK-NEXT: s_addc_u32 s5, s5, -1
+; CHECK-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[4:7] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[8:11] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[12:15] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[16:19] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[20:23] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[24:27] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[28:31] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[32:35] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[36:39] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87]
+; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99] offset:16
+; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
+; CHECK-NEXT: s_cbranch_scc0 .LBB5_5
+; CHECK-NEXT: .LBB5_6: ; %Flow10
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memmove_p0_p0_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b32 s4, exec_lo
+; ALIGNED-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; ALIGNED-NEXT: s_xor_b32 s6, exec_lo, s4
+; ALIGNED-NEXT: s_cbranch_execz .LBB5_3
+; ALIGNED-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0
+; ALIGNED-NEXT: .LBB5_2: ; %memmove_fwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: flat_load_ubyte v6, v[4:5] offset:2
+; ALIGNED-NEXT: flat_load_ubyte v7, v[4:5] offset:3
+; ALIGNED-NEXT: flat_load_ubyte v8, v[4:5]
+; ALIGNED-NEXT: flat_load_ubyte v9, v[4:5] offset:1
+; ALIGNED-NEXT: flat_load_ubyte v10, v[4:5] offset:6
+; ALIGNED-NEXT: flat_load_ubyte v11, v[4:5] offset:7
+; ALIGNED-NEXT: flat_load_ubyte v12, v[4:5] offset:4
+; ALIGNED-NEXT: flat_load_ubyte v13, v[4:5] offset:5
+; ALIGNED-NEXT: flat_load_ubyte v14, v[4:5] offset:10
+; ALIGNED-NEXT: flat_load_ubyte v15, v[4:5] offset:11
+; ALIGNED-NEXT: flat_load_ubyte v16, v[4:5] offset:8
+; ALIGNED-NEXT: flat_load_ubyte v17, v[4:5] offset:9
+; ALIGNED-NEXT: flat_load_ubyte v18, v[4:5] offset:14
+; ALIGNED-NEXT: flat_load_ubyte v19, v[4:5] offset:15
+; ALIGNED-NEXT: flat_load_ubyte v20, v[4:5] offset:12
+; ALIGNED-NEXT: flat_load_ubyte v21, v[4:5] offset:13
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, 16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v6 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v7 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v8
+; ALIGNED-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v9 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v10 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v11 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v12 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v13 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v14 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v15 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v16 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v17 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v18 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v19 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v20 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v21 offset:13
+; ALIGNED-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; ALIGNED-NEXT: s_cbranch_scc1 .LBB5_2
+; ALIGNED-NEXT: .LBB5_3: ; %Flow9
+; ALIGNED-NEXT: s_andn2_saveexec_b32 s6, s6
+; ALIGNED-NEXT: s_cbranch_execz .LBB5_6
+; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0x7f0
+; ALIGNED-NEXT: .LBB5_5: ; %memmove_bwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: flat_load_ubyte v6, v[4:5] offset:2
+; ALIGNED-NEXT: flat_load_ubyte v7, v[4:5] offset:3
+; ALIGNED-NEXT: flat_load_ubyte v8, v[4:5]
+; ALIGNED-NEXT: flat_load_ubyte v9, v[4:5] offset:1
+; ALIGNED-NEXT: flat_load_ubyte v10, v[4:5] offset:6
+; ALIGNED-NEXT: flat_load_ubyte v11, v[4:5] offset:7
+; ALIGNED-NEXT: flat_load_ubyte v12, v[4:5] offset:4
+; ALIGNED-NEXT: flat_load_ubyte v13, v[4:5] offset:5
+; ALIGNED-NEXT: flat_load_ubyte v14, v[4:5] offset:10
+; ALIGNED-NEXT: flat_load_ubyte v15, v[4:5] offset:11
+; ALIGNED-NEXT: flat_load_ubyte v16, v[4:5] offset:8
+; ALIGNED-NEXT: flat_load_ubyte v17, v[4:5] offset:9
+; ALIGNED-NEXT: flat_load_ubyte v18, v[4:5] offset:14
+; ALIGNED-NEXT: flat_load_ubyte v19, v[4:5] offset:15
+; ALIGNED-NEXT: flat_load_ubyte v20, v[4:5] offset:12
+; ALIGNED-NEXT: flat_load_ubyte v21, v[4:5] offset:13
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, -16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, -1
+; ALIGNED-NEXT: s_waitcnt vmcnt(15) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v6 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v7 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v8
+; ALIGNED-NEXT: s_waitcnt vmcnt(12) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v9 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v10 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v11 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v12 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v13 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v14 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v15 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v16 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v17 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v18 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v19 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v20 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v21 offset:13
+; ALIGNED-NEXT: s_cmp_eq_u64 s[4:5], -16
+; ALIGNED-NEXT: s_cbranch_scc0 .LBB5_5
+; ALIGNED-NEXT: .LBB5_6: ; %Flow10
+; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memmove.p0.p0.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(0) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align 1 readonly %src) {
+; CHECK-LABEL: memmove_p1_p1_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b32 s4, exec_lo
+; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; CHECK-NEXT: s_xor_b32 s6, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execz .LBB6_3
+; CHECK-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: .LBB6_2: ; %memmove_fwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[96:97], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[96:97], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[96:97], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[96:97], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[96:97], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[96:97], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[96:97], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[96:97], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[96:97], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[96:97], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[96:97], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[96:97], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[96:97], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[96:97], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[96:97], off
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[96:97], off offset:16
+; CHECK-NEXT: s_add_u32 s4, s4, 0x100
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[4:7], off offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[8:11], off offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[12:15], off offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[16:19], off offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[20:23], off offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[24:27], off offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[28:31], off offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[32:35], off offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[36:39], off offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[48:51], off offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[52:55], off offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[64:67], off offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[68:71], off offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[80:83], off offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[84:87], off
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
+; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; CHECK-NEXT: s_cbranch_scc1 .LBB6_2
+; CHECK-NEXT: .LBB6_3: ; %Flow13
+; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
+; CHECK-NEXT: s_cbranch_execz .LBB6_6
+; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; CHECK-NEXT: s_movk_i32 s6, 0xff00
+; CHECK-NEXT: s_mov_b64 s[4:5], 0x700
+; CHECK-NEXT: s_mov_b32 s7, -1
+; CHECK-NEXT: .LBB6_5: ; %memmove_bwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[96:97], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[96:97], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[96:97], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[96:97], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[96:97], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[96:97], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[96:97], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[96:97], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[96:97], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[96:97], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[96:97], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[96:97], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[96:97], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[96:97], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[96:97], off
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[96:97], off offset:16
+; CHECK-NEXT: s_add_u32 s4, s4, 0xffffff00
+; CHECK-NEXT: s_addc_u32 s5, s5, -1
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[4:7], off offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[8:11], off offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[12:15], off offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[16:19], off offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[20:23], off offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[24:27], off offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[28:31], off offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[32:35], off offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[36:39], off offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[48:51], off offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[52:55], off offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[64:67], off offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[68:71], off offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[80:83], off offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[84:87], off
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
+; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
+; CHECK-NEXT: s_cbranch_scc0 .LBB6_5
+; CHECK-NEXT: .LBB6_6: ; %Flow14
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memmove_p1_p1_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b32 s4, exec_lo
+; ALIGNED-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; ALIGNED-NEXT: s_xor_b32 s6, exec_lo, s4
+; ALIGNED-NEXT: s_cbranch_execz .LBB6_3
+; ALIGNED-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0
+; ALIGNED-NEXT: .LBB6_2: ; %memmove_fwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v6, v[4:5], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v7, v[4:5], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v8, v[4:5], off
+; ALIGNED-NEXT: global_load_ubyte v9, v[4:5], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v10, v[4:5], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v11, v[4:5], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v12, v[4:5], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v13, v[4:5], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v14, v[4:5], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v15, v[4:5], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v16, v[4:5], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v17, v[4:5], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v18, v[4:5], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v19, v[4:5], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v20, v[4:5], off offset:12
+; ALIGNED-NEXT: global_load_ubyte v21, v[4:5], off offset:13
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, 16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: global_store_byte v[4:5], v6, off offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: global_store_byte v[4:5], v7, off offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: global_store_byte v[4:5], v8, off
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: global_store_byte v[4:5], v9, off offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: global_store_byte v[4:5], v10, off offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: global_store_byte v[4:5], v11, off offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: global_store_byte v[4:5], v12, off offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: global_store_byte v[4:5], v13, off offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: global_store_byte v[4:5], v14, off offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: global_store_byte v[4:5], v15, off offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: global_store_byte v[4:5], v16, off offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: global_store_byte v[4:5], v17, off offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: global_store_byte v[4:5], v18, off offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: global_store_byte v[4:5], v19, off offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: global_store_byte v[4:5], v20, off offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: global_store_byte v[4:5], v21, off offset:13
+; ALIGNED-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; ALIGNED-NEXT: s_cbranch_scc1 .LBB6_2
+; ALIGNED-NEXT: .LBB6_3: ; %Flow13
+; ALIGNED-NEXT: s_andn2_saveexec_b32 s6, s6
+; ALIGNED-NEXT: s_cbranch_execz .LBB6_6
+; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0x7f0
+; ALIGNED-NEXT: .LBB6_5: ; %memmove_bwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v6, v[4:5], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v7, v[4:5], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v8, v[4:5], off
+; ALIGNED-NEXT: global_load_ubyte v9, v[4:5], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v10, v[4:5], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v11, v[4:5], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v12, v[4:5], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v13, v[4:5], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v14, v[4:5], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v15, v[4:5], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v16, v[4:5], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v17, v[4:5], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v18, v[4:5], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v19, v[4:5], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v20, v[4:5], off offset:12
+; ALIGNED-NEXT: global_load_ubyte v21, v[4:5], off offset:13
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, -16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, -1
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: global_store_byte v[4:5], v6, off offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: global_store_byte v[4:5], v7, off offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: global_store_byte v[4:5], v8, off
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: global_store_byte v[4:5], v9, off offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: global_store_byte v[4:5], v10, off offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: global_store_byte v[4:5], v11, off offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: global_store_byte v[4:5], v12, off offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: global_store_byte v[4:5], v13, off offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: global_store_byte v[4:5], v14, off offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: global_store_byte v[4:5], v15, off offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: global_store_byte v[4:5], v16, off offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: global_store_byte v[4:5], v17, off offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: global_store_byte v[4:5], v18, off offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: global_store_byte v[4:5], v19, off offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: global_store_byte v[4:5], v20, off offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: global_store_byte v[4:5], v21, off offset:13
+; ALIGNED-NEXT: s_cmp_eq_u64 s[4:5], -16
+; ALIGNED-NEXT: s_cbranch_scc0 .LBB6_5
+; ALIGNED-NEXT: .LBB6_6: ; %Flow14
+; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memmove.p1.p1.i64(ptr addrspace(1) noundef nonnull align 1 %dst, ptr addrspace(1) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memmove_p0_p4_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align 1 readonly %src) {
+; CHECK-LABEL: memmove_p0_p4_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b32 s4, exec_lo
+; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; CHECK-NEXT: s_xor_b32 s6, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execz .LBB7_3
+; CHECK-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: .LBB7_2: ; %memmove_fwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[96:97], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[96:97], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[96:97], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[96:97], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[96:97], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[96:97], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[96:97], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[96:97], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[96:97], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[96:97], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[96:97], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[96:97], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[96:97], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[96:97], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[96:97], off offset:16
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[96:97], off
+; CHECK-NEXT: s_add_u32 s4, s4, 0x100
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[4:7] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[8:11] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[12:15] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[16:19] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[20:23] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[24:27] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[28:31] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[32:35] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[36:39] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; CHECK-NEXT: s_cbranch_scc1 .LBB7_2
+; CHECK-NEXT: .LBB7_3: ; %Flow9
+; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
+; CHECK-NEXT: s_cbranch_execz .LBB7_6
+; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; CHECK-NEXT: s_movk_i32 s6, 0xff00
+; CHECK-NEXT: s_mov_b64 s[4:5], 0x700
+; CHECK-NEXT: s_mov_b32 s7, -1
+; CHECK-NEXT: .LBB7_5: ; %memmove_bwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: v_add_co_u32 v96, vcc_lo, v2, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v97, vcc_lo, s5, v3, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: s_clause 0xf
+; CHECK-NEXT: global_load_dwordx4 v[4:7], v[96:97], off offset:240
+; CHECK-NEXT: global_load_dwordx4 v[8:11], v[96:97], off offset:224
+; CHECK-NEXT: global_load_dwordx4 v[12:15], v[96:97], off offset:208
+; CHECK-NEXT: global_load_dwordx4 v[16:19], v[96:97], off offset:192
+; CHECK-NEXT: global_load_dwordx4 v[20:23], v[96:97], off offset:176
+; CHECK-NEXT: global_load_dwordx4 v[24:27], v[96:97], off offset:160
+; CHECK-NEXT: global_load_dwordx4 v[28:31], v[96:97], off offset:144
+; CHECK-NEXT: global_load_dwordx4 v[32:35], v[96:97], off offset:128
+; CHECK-NEXT: global_load_dwordx4 v[36:39], v[96:97], off offset:112
+; CHECK-NEXT: global_load_dwordx4 v[48:51], v[96:97], off offset:96
+; CHECK-NEXT: global_load_dwordx4 v[52:55], v[96:97], off offset:80
+; CHECK-NEXT: global_load_dwordx4 v[64:67], v[96:97], off offset:64
+; CHECK-NEXT: global_load_dwordx4 v[68:71], v[96:97], off offset:48
+; CHECK-NEXT: global_load_dwordx4 v[80:83], v[96:97], off offset:32
+; CHECK-NEXT: global_load_dwordx4 v[84:87], v[96:97], off offset:16
+; CHECK-NEXT: global_load_dwordx4 v[96:99], v[96:97], off
+; CHECK-NEXT: s_add_u32 s4, s4, 0xffffff00
+; CHECK-NEXT: s_addc_u32 s5, s5, -1
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[4:7] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[8:11] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[12:15] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[16:19] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[20:23] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[24:27] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[28:31] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[32:35] offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[36:39] offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
+; CHECK-NEXT: s_cbranch_scc0 .LBB7_5
+; CHECK-NEXT: .LBB7_6: ; %Flow10
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memmove_p0_p4_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b32 s4, exec_lo
+; ALIGNED-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
+; ALIGNED-NEXT: s_xor_b32 s6, exec_lo, s4
+; ALIGNED-NEXT: s_cbranch_execz .LBB7_3
+; ALIGNED-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0
+; ALIGNED-NEXT: .LBB7_2: ; %memmove_fwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v6, v[4:5], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v7, v[4:5], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v8, v[4:5], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v9, v[4:5], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v10, v[4:5], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v11, v[4:5], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v12, v[4:5], off
+; ALIGNED-NEXT: global_load_ubyte v13, v[4:5], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v14, v[4:5], off offset:13
+; ALIGNED-NEXT: global_load_ubyte v15, v[4:5], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v16, v[4:5], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v17, v[4:5], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v18, v[4:5], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v19, v[4:5], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v20, v[4:5], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v21, v[4:5], off offset:12
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, 16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v9 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v10 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v11 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v12
+; ALIGNED-NEXT: flat_store_byte v[4:5], v8 offset:7
+; ALIGNED-NEXT: flat_store_byte v[4:5], v7 offset:6
+; ALIGNED-NEXT: flat_store_byte v[4:5], v6 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v13 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v17 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v18 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v19 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v20 offset:8
+; ALIGNED-NEXT: flat_store_byte v[4:5], v16 offset:15
+; ALIGNED-NEXT: flat_store_byte v[4:5], v15 offset:14
+; ALIGNED-NEXT: flat_store_byte v[4:5], v14 offset:13
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v21 offset:12
+; ALIGNED-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; ALIGNED-NEXT: s_cbranch_scc1 .LBB7_2
+; ALIGNED-NEXT: .LBB7_3: ; %Flow9
+; ALIGNED-NEXT: s_andn2_saveexec_b32 s6, s6
+; ALIGNED-NEXT: s_cbranch_execz .LBB7_6
+; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0x7f0
+; ALIGNED-NEXT: .LBB7_5: ; %memmove_bwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v2, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v3, vcc_lo
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: global_load_ubyte v6, v[4:5], off offset:5
+; ALIGNED-NEXT: global_load_ubyte v7, v[4:5], off offset:6
+; ALIGNED-NEXT: global_load_ubyte v8, v[4:5], off offset:7
+; ALIGNED-NEXT: global_load_ubyte v9, v[4:5], off offset:3
+; ALIGNED-NEXT: global_load_ubyte v10, v[4:5], off offset:2
+; ALIGNED-NEXT: global_load_ubyte v11, v[4:5], off offset:1
+; ALIGNED-NEXT: global_load_ubyte v12, v[4:5], off
+; ALIGNED-NEXT: global_load_ubyte v13, v[4:5], off offset:4
+; ALIGNED-NEXT: global_load_ubyte v14, v[4:5], off offset:13
+; ALIGNED-NEXT: global_load_ubyte v15, v[4:5], off offset:14
+; ALIGNED-NEXT: global_load_ubyte v16, v[4:5], off offset:15
+; ALIGNED-NEXT: global_load_ubyte v17, v[4:5], off offset:11
+; ALIGNED-NEXT: global_load_ubyte v18, v[4:5], off offset:10
+; ALIGNED-NEXT: global_load_ubyte v19, v[4:5], off offset:9
+; ALIGNED-NEXT: global_load_ubyte v20, v[4:5], off offset:8
+; ALIGNED-NEXT: global_load_ubyte v21, v[4:5], off offset:12
+; ALIGNED-NEXT: v_add_co_u32 v4, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: s_add_u32 s4, s4, -16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, -1
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v9 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v10 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v11 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v12
+; ALIGNED-NEXT: flat_store_byte v[4:5], v8 offset:7
+; ALIGNED-NEXT: flat_store_byte v[4:5], v7 offset:6
+; ALIGNED-NEXT: flat_store_byte v[4:5], v6 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v13 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v17 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v18 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v19 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v20 offset:8
+; ALIGNED-NEXT: flat_store_byte v[4:5], v16 offset:15
+; ALIGNED-NEXT: flat_store_byte v[4:5], v15 offset:14
+; ALIGNED-NEXT: flat_store_byte v[4:5], v14 offset:13
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[4:5], v21 offset:12
+; ALIGNED-NEXT: s_cmp_eq_u64 s[4:5], -16
+; ALIGNED-NEXT: s_cbranch_scc0 .LBB7_5
+; ALIGNED-NEXT: .LBB7_6: ; %Flow10
+; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memmove.p0.p4.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(4) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memmove_p5_p5_sz2048(ptr addrspace(5) align 1 %dst, ptr addrspace(5) align 1 readonly %src) {
+; CHECK-LABEL: memmove_p5_p5_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: s_mov_b32 s4, exec_lo
+; CHECK-NEXT: v_cmpx_ge_u32_e64 v1, v0
+; CHECK-NEXT: s_xor_b32 s6, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execz .LBB8_3
+; CHECK-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; CHECK-NEXT: s_mov_b64 s[4:5], 8
+; CHECK-NEXT: .LBB8_2: ; %memmove_fwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v2, v1, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v3, v1, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v4, v1, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v5, v1, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v6, v1, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v7, v1, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v8, v1, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v9, v1, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v10, v1, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v11, v1, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v12, v1, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v13, v1, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v14, v1, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v15, v1, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v16, v1, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v17, v1, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v18, v1, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v19, v1, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v20, v1, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v21, v1, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v22, v1, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v23, v1, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v24, v1, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v25, v1, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v26, v1, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v27, v1, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v28, v1, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v29, v1, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v30, v1, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v31, v1, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v32, v1, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v33, v1, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v34, v1, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v35, v1, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v36, v1, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v37, v1, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v38, v1, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v39, v1, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v48, v1, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v49, v1, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v50, v1, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v51, v1, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v52, v1, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v53, v1, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v54, v1, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v55, v1, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v64, v1, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v65, v1, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v66, v1, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v67, v1, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v68, v1, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v69, v1, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v70, v1, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v71, v1, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v80, v1, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v81, v1, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v82, v1, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v83, v1, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v84, v1, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v85, v1, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v86, v1, s[0:3], 0 offen offset:12
+; CHECK-NEXT: buffer_load_dword v87, v1, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v96, v1, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v97, v1, s[0:3], 0 offen
+; CHECK-NEXT: v_add_nc_u32_e32 v1, 0x100, v1
+; CHECK-NEXT: s_add_u32 s4, s4, -1
+; CHECK-NEXT: s_addc_u32 s5, s5, -1
+; CHECK-NEXT: s_waitcnt vmcnt(62)
+; CHECK-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_store_dword v3, v0, s[0:3], 0 offen offset:248
+; CHECK-NEXT: s_waitcnt vmcnt(61)
+; CHECK-NEXT: buffer_store_dword v4, v0, s[0:3], 0 offen offset:244
+; CHECK-NEXT: s_waitcnt vmcnt(60)
+; CHECK-NEXT: buffer_store_dword v5, v0, s[0:3], 0 offen offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(59)
+; CHECK-NEXT: buffer_store_dword v6, v0, s[0:3], 0 offen offset:236
+; CHECK-NEXT: s_waitcnt vmcnt(58)
+; CHECK-NEXT: buffer_store_dword v7, v0, s[0:3], 0 offen offset:232
+; CHECK-NEXT: s_waitcnt vmcnt(57)
+; CHECK-NEXT: buffer_store_dword v8, v0, s[0:3], 0 offen offset:228
+; CHECK-NEXT: s_waitcnt vmcnt(56)
+; CHECK-NEXT: buffer_store_dword v9, v0, s[0:3], 0 offen offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(55)
+; CHECK-NEXT: buffer_store_dword v10, v0, s[0:3], 0 offen offset:220
+; CHECK-NEXT: s_waitcnt vmcnt(54)
+; CHECK-NEXT: buffer_store_dword v11, v0, s[0:3], 0 offen offset:216
+; CHECK-NEXT: s_waitcnt vmcnt(53)
+; CHECK-NEXT: buffer_store_dword v12, v0, s[0:3], 0 offen offset:212
+; CHECK-NEXT: s_waitcnt vmcnt(52)
+; CHECK-NEXT: buffer_store_dword v13, v0, s[0:3], 0 offen offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(51)
+; CHECK-NEXT: buffer_store_dword v14, v0, s[0:3], 0 offen offset:204
+; CHECK-NEXT: s_waitcnt vmcnt(50)
+; CHECK-NEXT: buffer_store_dword v15, v0, s[0:3], 0 offen offset:200
+; CHECK-NEXT: s_waitcnt vmcnt(49)
+; CHECK-NEXT: buffer_store_dword v16, v0, s[0:3], 0 offen offset:196
+; CHECK-NEXT: s_waitcnt vmcnt(48)
+; CHECK-NEXT: buffer_store_dword v17, v0, s[0:3], 0 offen offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(47)
+; CHECK-NEXT: buffer_store_dword v18, v0, s[0:3], 0 offen offset:188
+; CHECK-NEXT: s_waitcnt vmcnt(46)
+; CHECK-NEXT: buffer_store_dword v19, v0, s[0:3], 0 offen offset:184
+; CHECK-NEXT: s_waitcnt vmcnt(45)
+; CHECK-NEXT: buffer_store_dword v20, v0, s[0:3], 0 offen offset:180
+; CHECK-NEXT: s_waitcnt vmcnt(44)
+; CHECK-NEXT: buffer_store_dword v21, v0, s[0:3], 0 offen offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(43)
+; CHECK-NEXT: buffer_store_dword v22, v0, s[0:3], 0 offen offset:172
+; CHECK-NEXT: s_waitcnt vmcnt(42)
+; CHECK-NEXT: buffer_store_dword v23, v0, s[0:3], 0 offen offset:168
+; CHECK-NEXT: s_waitcnt vmcnt(41)
+; CHECK-NEXT: buffer_store_dword v24, v0, s[0:3], 0 offen offset:164
+; CHECK-NEXT: s_waitcnt vmcnt(40)
+; CHECK-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(39)
+; CHECK-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:156
+; CHECK-NEXT: s_waitcnt vmcnt(38)
+; CHECK-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:152
+; CHECK-NEXT: s_waitcnt vmcnt(37)
+; CHECK-NEXT: buffer_store_dword v28, v0, s[0:3], 0 offen offset:148
+; CHECK-NEXT: s_waitcnt vmcnt(36)
+; CHECK-NEXT: buffer_store_dword v29, v0, s[0:3], 0 offen offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(35)
+; CHECK-NEXT: buffer_store_dword v30, v0, s[0:3], 0 offen offset:140
+; CHECK-NEXT: s_waitcnt vmcnt(34)
+; CHECK-NEXT: buffer_store_dword v31, v0, s[0:3], 0 offen offset:136
+; CHECK-NEXT: s_waitcnt vmcnt(33)
+; CHECK-NEXT: buffer_store_dword v32, v0, s[0:3], 0 offen offset:132
+; CHECK-NEXT: s_waitcnt vmcnt(32)
+; CHECK-NEXT: buffer_store_dword v33, v0, s[0:3], 0 offen offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(31)
+; CHECK-NEXT: buffer_store_dword v34, v0, s[0:3], 0 offen offset:124
+; CHECK-NEXT: s_waitcnt vmcnt(30)
+; CHECK-NEXT: buffer_store_dword v35, v0, s[0:3], 0 offen offset:120
+; CHECK-NEXT: s_waitcnt vmcnt(29)
+; CHECK-NEXT: buffer_store_dword v36, v0, s[0:3], 0 offen offset:116
+; CHECK-NEXT: s_waitcnt vmcnt(28)
+; CHECK-NEXT: buffer_store_dword v37, v0, s[0:3], 0 offen offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(27)
+; CHECK-NEXT: buffer_store_dword v38, v0, s[0:3], 0 offen offset:108
+; CHECK-NEXT: s_waitcnt vmcnt(26)
+; CHECK-NEXT: buffer_store_dword v39, v0, s[0:3], 0 offen offset:104
+; CHECK-NEXT: s_waitcnt vmcnt(25)
+; CHECK-NEXT: buffer_store_dword v48, v0, s[0:3], 0 offen offset:100
+; CHECK-NEXT: s_waitcnt vmcnt(24)
+; CHECK-NEXT: buffer_store_dword v49, v0, s[0:3], 0 offen offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(23)
+; CHECK-NEXT: buffer_store_dword v50, v0, s[0:3], 0 offen offset:92
+; CHECK-NEXT: s_waitcnt vmcnt(22)
+; CHECK-NEXT: buffer_store_dword v51, v0, s[0:3], 0 offen offset:88
+; CHECK-NEXT: s_waitcnt vmcnt(21)
+; CHECK-NEXT: buffer_store_dword v52, v0, s[0:3], 0 offen offset:84
+; CHECK-NEXT: s_waitcnt vmcnt(20)
+; CHECK-NEXT: buffer_store_dword v53, v0, s[0:3], 0 offen offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(19)
+; CHECK-NEXT: buffer_store_dword v54, v0, s[0:3], 0 offen offset:76
+; CHECK-NEXT: s_waitcnt vmcnt(18)
+; CHECK-NEXT: buffer_store_dword v55, v0, s[0:3], 0 offen offset:72
+; CHECK-NEXT: s_waitcnt vmcnt(17)
+; CHECK-NEXT: buffer_store_dword v64, v0, s[0:3], 0 offen offset:68
+; CHECK-NEXT: s_waitcnt vmcnt(16)
+; CHECK-NEXT: buffer_store_dword v65, v0, s[0:3], 0 offen offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: buffer_store_dword v66, v0, s[0:3], 0 offen offset:60
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: buffer_store_dword v67, v0, s[0:3], 0 offen offset:56
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: buffer_store_dword v68, v0, s[0:3], 0 offen offset:52
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: buffer_store_dword v69, v0, s[0:3], 0 offen offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: buffer_store_dword v70, v0, s[0:3], 0 offen offset:44
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: buffer_store_dword v71, v0, s[0:3], 0 offen offset:40
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: buffer_store_dword v80, v0, s[0:3], 0 offen offset:36
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: buffer_store_dword v81, v0, s[0:3], 0 offen offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: buffer_store_dword v82, v0, s[0:3], 0 offen offset:28
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: buffer_store_dword v83, v0, s[0:3], 0 offen offset:24
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: buffer_store_dword v84, v0, s[0:3], 0 offen offset:20
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: buffer_store_dword v85, v0, s[0:3], 0 offen offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: buffer_store_dword v86, v0, s[0:3], 0 offen offset:12
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: buffer_store_dword v87, v0, s[0:3], 0 offen offset:8
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: buffer_store_dword v96, v0, s[0:3], 0 offen offset:4
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: buffer_store_dword v97, v0, s[0:3], 0 offen
+; CHECK-NEXT: v_add_nc_u32_e32 v0, 0x100, v0
+; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0
+; CHECK-NEXT: s_cbranch_scc1 .LBB8_2
+; CHECK-NEXT: .LBB8_3: ; %Flow18
+; CHECK-NEXT: s_andn2_saveexec_b32 s6, s6
+; CHECK-NEXT: s_cbranch_execz .LBB8_6
+; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; CHECK-NEXT: v_add_nc_u32_e32 v0, 0x700, v0
+; CHECK-NEXT: v_add_nc_u32_e32 v1, 0x700, v1
+; CHECK-NEXT: s_mov_b64 s[4:5], -8
+; CHECK-NEXT: .LBB8_5: ; %memmove_bwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v2, v1, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v3, v1, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v4, v1, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v5, v1, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v6, v1, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v7, v1, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v8, v1, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v9, v1, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v10, v1, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v11, v1, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v12, v1, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v13, v1, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v14, v1, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v15, v1, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v16, v1, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v17, v1, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v18, v1, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v19, v1, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v20, v1, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v21, v1, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v22, v1, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v23, v1, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v24, v1, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v25, v1, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v26, v1, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v27, v1, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v28, v1, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v29, v1, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v30, v1, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v31, v1, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v32, v1, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v33, v1, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v34, v1, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v35, v1, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v36, v1, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v37, v1, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v38, v1, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v39, v1, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v48, v1, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v49, v1, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v50, v1, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v51, v1, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v52, v1, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v53, v1, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v54, v1, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v55, v1, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v64, v1, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v65, v1, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v66, v1, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v67, v1, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v68, v1, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v69, v1, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v70, v1, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v71, v1, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v80, v1, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v81, v1, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v82, v1, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v83, v1, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v84, v1, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v85, v1, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v86, v1, s[0:3], 0 offen offset:12
+; CHECK-NEXT: buffer_load_dword v87, v1, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v96, v1, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v97, v1, s[0:3], 0 offen
+; CHECK-NEXT: v_add_nc_u32_e32 v1, 0xffffff00, v1
+; CHECK-NEXT: s_add_u32 s4, s4, 1
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: s_waitcnt vmcnt(62)
+; CHECK-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_store_dword v3, v0, s[0:3], 0 offen offset:248
+; CHECK-NEXT: s_waitcnt vmcnt(61)
+; CHECK-NEXT: buffer_store_dword v4, v0, s[0:3], 0 offen offset:244
+; CHECK-NEXT: s_waitcnt vmcnt(60)
+; CHECK-NEXT: buffer_store_dword v5, v0, s[0:3], 0 offen offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(59)
+; CHECK-NEXT: buffer_store_dword v6, v0, s[0:3], 0 offen offset:236
+; CHECK-NEXT: s_waitcnt vmcnt(58)
+; CHECK-NEXT: buffer_store_dword v7, v0, s[0:3], 0 offen offset:232
+; CHECK-NEXT: s_waitcnt vmcnt(57)
+; CHECK-NEXT: buffer_store_dword v8, v0, s[0:3], 0 offen offset:228
+; CHECK-NEXT: s_waitcnt vmcnt(56)
+; CHECK-NEXT: buffer_store_dword v9, v0, s[0:3], 0 offen offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(55)
+; CHECK-NEXT: buffer_store_dword v10, v0, s[0:3], 0 offen offset:220
+; CHECK-NEXT: s_waitcnt vmcnt(54)
+; CHECK-NEXT: buffer_store_dword v11, v0, s[0:3], 0 offen offset:216
+; CHECK-NEXT: s_waitcnt vmcnt(53)
+; CHECK-NEXT: buffer_store_dword v12, v0, s[0:3], 0 offen offset:212
+; CHECK-NEXT: s_waitcnt vmcnt(52)
+; CHECK-NEXT: buffer_store_dword v13, v0, s[0:3], 0 offen offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(51)
+; CHECK-NEXT: buffer_store_dword v14, v0, s[0:3], 0 offen offset:204
+; CHECK-NEXT: s_waitcnt vmcnt(50)
+; CHECK-NEXT: buffer_store_dword v15, v0, s[0:3], 0 offen offset:200
+; CHECK-NEXT: s_waitcnt vmcnt(49)
+; CHECK-NEXT: buffer_store_dword v16, v0, s[0:3], 0 offen offset:196
+; CHECK-NEXT: s_waitcnt vmcnt(48)
+; CHECK-NEXT: buffer_store_dword v17, v0, s[0:3], 0 offen offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(47)
+; CHECK-NEXT: buffer_store_dword v18, v0, s[0:3], 0 offen offset:188
+; CHECK-NEXT: s_waitcnt vmcnt(46)
+; CHECK-NEXT: buffer_store_dword v19, v0, s[0:3], 0 offen offset:184
+; CHECK-NEXT: s_waitcnt vmcnt(45)
+; CHECK-NEXT: buffer_store_dword v20, v0, s[0:3], 0 offen offset:180
+; CHECK-NEXT: s_waitcnt vmcnt(44)
+; CHECK-NEXT: buffer_store_dword v21, v0, s[0:3], 0 offen offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(43)
+; CHECK-NEXT: buffer_store_dword v22, v0, s[0:3], 0 offen offset:172
+; CHECK-NEXT: s_waitcnt vmcnt(42)
+; CHECK-NEXT: buffer_store_dword v23, v0, s[0:3], 0 offen offset:168
+; CHECK-NEXT: s_waitcnt vmcnt(41)
+; CHECK-NEXT: buffer_store_dword v24, v0, s[0:3], 0 offen offset:164
+; CHECK-NEXT: s_waitcnt vmcnt(40)
+; CHECK-NEXT: buffer_store_dword v25, v0, s[0:3], 0 offen offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(39)
+; CHECK-NEXT: buffer_store_dword v26, v0, s[0:3], 0 offen offset:156
+; CHECK-NEXT: s_waitcnt vmcnt(38)
+; CHECK-NEXT: buffer_store_dword v27, v0, s[0:3], 0 offen offset:152
+; CHECK-NEXT: s_waitcnt vmcnt(37)
+; CHECK-NEXT: buffer_store_dword v28, v0, s[0:3], 0 offen offset:148
+; CHECK-NEXT: s_waitcnt vmcnt(36)
+; CHECK-NEXT: buffer_store_dword v29, v0, s[0:3], 0 offen offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(35)
+; CHECK-NEXT: buffer_store_dword v30, v0, s[0:3], 0 offen offset:140
+; CHECK-NEXT: s_waitcnt vmcnt(34)
+; CHECK-NEXT: buffer_store_dword v31, v0, s[0:3], 0 offen offset:136
+; CHECK-NEXT: s_waitcnt vmcnt(33)
+; CHECK-NEXT: buffer_store_dword v32, v0, s[0:3], 0 offen offset:132
+; CHECK-NEXT: s_waitcnt vmcnt(32)
+; CHECK-NEXT: buffer_store_dword v33, v0, s[0:3], 0 offen offset:128
+; CHECK-NEXT: s_waitcnt vmcnt(31)
+; CHECK-NEXT: buffer_store_dword v34, v0, s[0:3], 0 offen offset:124
+; CHECK-NEXT: s_waitcnt vmcnt(30)
+; CHECK-NEXT: buffer_store_dword v35, v0, s[0:3], 0 offen offset:120
+; CHECK-NEXT: s_waitcnt vmcnt(29)
+; CHECK-NEXT: buffer_store_dword v36, v0, s[0:3], 0 offen offset:116
+; CHECK-NEXT: s_waitcnt vmcnt(28)
+; CHECK-NEXT: buffer_store_dword v37, v0, s[0:3], 0 offen offset:112
+; CHECK-NEXT: s_waitcnt vmcnt(27)
+; CHECK-NEXT: buffer_store_dword v38, v0, s[0:3], 0 offen offset:108
+; CHECK-NEXT: s_waitcnt vmcnt(26)
+; CHECK-NEXT: buffer_store_dword v39, v0, s[0:3], 0 offen offset:104
+; CHECK-NEXT: s_waitcnt vmcnt(25)
+; CHECK-NEXT: buffer_store_dword v48, v0, s[0:3], 0 offen offset:100
+; CHECK-NEXT: s_waitcnt vmcnt(24)
+; CHECK-NEXT: buffer_store_dword v49, v0, s[0:3], 0 offen offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(23)
+; CHECK-NEXT: buffer_store_dword v50, v0, s[0:3], 0 offen offset:92
+; CHECK-NEXT: s_waitcnt vmcnt(22)
+; CHECK-NEXT: buffer_store_dword v51, v0, s[0:3], 0 offen offset:88
+; CHECK-NEXT: s_waitcnt vmcnt(21)
+; CHECK-NEXT: buffer_store_dword v52, v0, s[0:3], 0 offen offset:84
+; CHECK-NEXT: s_waitcnt vmcnt(20)
+; CHECK-NEXT: buffer_store_dword v53, v0, s[0:3], 0 offen offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(19)
+; CHECK-NEXT: buffer_store_dword v54, v0, s[0:3], 0 offen offset:76
+; CHECK-NEXT: s_waitcnt vmcnt(18)
+; CHECK-NEXT: buffer_store_dword v55, v0, s[0:3], 0 offen offset:72
+; CHECK-NEXT: s_waitcnt vmcnt(17)
+; CHECK-NEXT: buffer_store_dword v64, v0, s[0:3], 0 offen offset:68
+; CHECK-NEXT: s_waitcnt vmcnt(16)
+; CHECK-NEXT: buffer_store_dword v65, v0, s[0:3], 0 offen offset:64
+; CHECK-NEXT: s_waitcnt vmcnt(15)
+; CHECK-NEXT: buffer_store_dword v66, v0, s[0:3], 0 offen offset:60
+; CHECK-NEXT: s_waitcnt vmcnt(14)
+; CHECK-NEXT: buffer_store_dword v67, v0, s[0:3], 0 offen offset:56
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: buffer_store_dword v68, v0, s[0:3], 0 offen offset:52
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: buffer_store_dword v69, v0, s[0:3], 0 offen offset:48
+; CHECK-NEXT: s_waitcnt vmcnt(11)
+; CHECK-NEXT: buffer_store_dword v70, v0, s[0:3], 0 offen offset:44
+; CHECK-NEXT: s_waitcnt vmcnt(10)
+; CHECK-NEXT: buffer_store_dword v71, v0, s[0:3], 0 offen offset:40
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: buffer_store_dword v80, v0, s[0:3], 0 offen offset:36
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: buffer_store_dword v81, v0, s[0:3], 0 offen offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(7)
+; CHECK-NEXT: buffer_store_dword v82, v0, s[0:3], 0 offen offset:28
+; CHECK-NEXT: s_waitcnt vmcnt(6)
+; CHECK-NEXT: buffer_store_dword v83, v0, s[0:3], 0 offen offset:24
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: buffer_store_dword v84, v0, s[0:3], 0 offen offset:20
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: buffer_store_dword v85, v0, s[0:3], 0 offen offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(3)
+; CHECK-NEXT: buffer_store_dword v86, v0, s[0:3], 0 offen offset:12
+; CHECK-NEXT: s_waitcnt vmcnt(2)
+; CHECK-NEXT: buffer_store_dword v87, v0, s[0:3], 0 offen offset:8
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: buffer_store_dword v96, v0, s[0:3], 0 offen offset:4
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: buffer_store_dword v97, v0, s[0:3], 0 offen
+; CHECK-NEXT: v_add_nc_u32_e32 v0, 0xffffff00, v0
+; CHECK-NEXT: s_cmp_eq_u64 s[4:5], 0
+; CHECK-NEXT: s_cbranch_scc0 .LBB8_5
+; CHECK-NEXT: .LBB8_6: ; %Flow19
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memmove_p5_p5_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: s_mov_b32 s4, exec_lo
+; ALIGNED-NEXT: v_cmpx_ge_u32_e64 v1, v0
+; ALIGNED-NEXT: s_xor_b32 s6, exec_lo, s4
+; ALIGNED-NEXT: s_cbranch_execz .LBB8_3
+; ALIGNED-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0x80
+; ALIGNED-NEXT: .LBB8_2: ; %memmove_fwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v2, v1, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v3, v1, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v4, v1, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v5, v1, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v6, v1, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v7, v1, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v8, v1, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v9, v1, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v11, v1, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v12, v1, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v13, v1, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v14, v1, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v15, v1, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v16, v1, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v17, v1, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_nc_u32_e32 v1, 16, v1
+; ALIGNED-NEXT: s_add_u32 s4, s4, -1
+; ALIGNED-NEXT: s_addc_u32 s5, s5, -1
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: buffer_store_byte v2, v0, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: buffer_store_byte v3, v0, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: buffer_store_byte v4, v0, s[0:3], 0 offen
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: buffer_store_byte v5, v0, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: buffer_store_byte v6, v0, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: buffer_store_byte v7, v0, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: buffer_store_byte v8, v0, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: buffer_store_byte v9, v0, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: buffer_store_byte v10, v0, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: buffer_store_byte v11, v0, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: buffer_store_byte v12, v0, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: buffer_store_byte v13, v0, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: buffer_store_byte v14, v0, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: buffer_store_byte v15, v0, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: buffer_store_byte v16, v0, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: buffer_store_byte v17, v0, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_nc_u32_e32 v0, 16, v0
+; ALIGNED-NEXT: s_cmp_lg_u64 s[4:5], 0
+; ALIGNED-NEXT: s_cbranch_scc1 .LBB8_2
+; ALIGNED-NEXT: .LBB8_3: ; %Flow18
+; ALIGNED-NEXT: s_andn2_saveexec_b32 s6, s6
+; ALIGNED-NEXT: s_cbranch_execz .LBB8_6
+; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; ALIGNED-NEXT: v_add_nc_u32_e32 v0, 0x7f0, v0
+; ALIGNED-NEXT: v_add_nc_u32_e32 v1, 0x7f0, v1
+; ALIGNED-NEXT: s_movk_i32 s4, 0xff80
+; ALIGNED-NEXT: s_mov_b32 s5, -1
+; ALIGNED-NEXT: .LBB8_5: ; %memmove_bwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v2, v1, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v3, v1, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v4, v1, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v5, v1, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v6, v1, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v7, v1, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v8, v1, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v9, v1, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v10, v1, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v11, v1, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v12, v1, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v13, v1, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v14, v1, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v15, v1, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v16, v1, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v17, v1, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_nc_u32_e32 v1, -16, v1
+; ALIGNED-NEXT: s_add_u32 s4, s4, 1
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: buffer_store_byte v2, v0, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: buffer_store_byte v3, v0, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: buffer_store_byte v4, v0, s[0:3], 0 offen
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: buffer_store_byte v5, v0, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: buffer_store_byte v6, v0, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: buffer_store_byte v7, v0, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: buffer_store_byte v8, v0, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: buffer_store_byte v9, v0, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: buffer_store_byte v10, v0, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: buffer_store_byte v11, v0, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: buffer_store_byte v12, v0, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: buffer_store_byte v13, v0, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: buffer_store_byte v14, v0, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: buffer_store_byte v15, v0, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: buffer_store_byte v16, v0, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: buffer_store_byte v17, v0, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_nc_u32_e32 v0, -16, v0
+; ALIGNED-NEXT: s_cmp_eq_u64 s[4:5], 0
+; ALIGNED-NEXT: s_cbranch_scc0 .LBB8_5
+; ALIGNED-NEXT: .LBB8_6: ; %Flow19
+; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memmove.p5.p5.i64(ptr addrspace(5) noundef nonnull align 1 %dst, ptr addrspace(5) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+define void @memmove_p0_p5_sz2048(ptr addrspace(0) align 1 %dst, ptr addrspace(5) align 1 readonly %src) {
+; CHECK-LABEL: memmove_p0_p5_sz2048:
+; CHECK: ; %bb.0: ; %entry
+; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT: v_cmp_ne_u32_e32 vcc_lo, -1, v2
+; CHECK-NEXT: s_mov_b64 s[4:5], src_private_base
+; CHECK-NEXT: s_mov_b32 s4, exec_lo
+; CHECK-NEXT: v_cndmask_b32_e64 v4, 0, s5, vcc_lo
+; CHECK-NEXT: v_cndmask_b32_e32 v3, 0, v2, vcc_lo
+; CHECK-NEXT: v_cmpx_ge_u64_e64 v[3:4], v[0:1]
+; CHECK-NEXT: s_xor_b32 s6, exec_lo, s4
+; CHECK-NEXT: s_cbranch_execz .LBB9_3
+; CHECK-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; CHECK-NEXT: s_mov_b64 s[4:5], 0
+; CHECK-NEXT: .LBB9_2: ; %memmove_fwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v18, v2, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v17, v2, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v16, v2, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v15, v2, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v22, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v21, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v20, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v19, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v26, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v25, v2, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v24, v2, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v23, v2, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v30, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v29, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v28, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v27, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v34, v2, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v33, v2, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v32, v2, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v31, v2, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v38, v2, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v37, v2, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v36, v2, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v35, v2, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v51, v2, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v50, v2, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v49, v2, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v48, v2, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v55, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v54, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v53, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v52, v2, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v67, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v66, v2, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v65, v2, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v64, v2, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v71, v2, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v70, v2, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v69, v2, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v68, v2, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v83, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v82, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v81, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v80, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v96, v2, s[0:3], 0 offen
+; CHECK-NEXT: buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: v_add_nc_u32_e32 v2, 0x100, v2
+; CHECK-NEXT: s_add_u32 s4, s4, 0x100
+; CHECK-NEXT: s_addc_u32 s5, s5, 0
+; CHECK-NEXT: s_waitcnt vmcnt(20)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(16)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:224
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(12)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:192
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[31:34] offset:176
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[35:38] offset:160
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[27:30] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(8)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:128
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[19:22] offset:112
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[23:26] offset:96
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[15:18] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(4)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87] offset:64
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[11:14] offset:48
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[7:10] offset:32
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[3:6] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; CHECK-NEXT: s_cbranch_scc1 .LBB9_2
+; CHECK-NEXT: .LBB9_3: ; %Flow13
+; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
+; CHECK-NEXT: s_cbranch_execz .LBB9_6
+; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; CHECK-NEXT: v_add_nc_u32_e32 v2, 0x700, v2
+; CHECK-NEXT: s_movk_i32 s6, 0xff00
+; CHECK-NEXT: s_mov_b64 s[4:5], 0x700
+; CHECK-NEXT: s_mov_b32 s7, -1
+; CHECK-NEXT: .LBB9_5: ; %memmove_bwd_loop
+; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
+; CHECK-NEXT: s_clause 0x3e
+; CHECK-NEXT: buffer_load_dword v4, v2, s[0:3], 0 offen offset:20
+; CHECK-NEXT: buffer_load_dword v5, v2, s[0:3], 0 offen offset:24
+; CHECK-NEXT: buffer_load_dword v6, v2, s[0:3], 0 offen offset:28
+; CHECK-NEXT: buffer_load_dword v7, v2, s[0:3], 0 offen offset:32
+; CHECK-NEXT: buffer_load_dword v8, v2, s[0:3], 0 offen offset:36
+; CHECK-NEXT: buffer_load_dword v9, v2, s[0:3], 0 offen offset:40
+; CHECK-NEXT: buffer_load_dword v10, v2, s[0:3], 0 offen offset:44
+; CHECK-NEXT: buffer_load_dword v11, v2, s[0:3], 0 offen offset:48
+; CHECK-NEXT: buffer_load_dword v12, v2, s[0:3], 0 offen offset:52
+; CHECK-NEXT: buffer_load_dword v13, v2, s[0:3], 0 offen offset:56
+; CHECK-NEXT: buffer_load_dword v14, v2, s[0:3], 0 offen offset:60
+; CHECK-NEXT: buffer_load_dword v18, v2, s[0:3], 0 offen offset:124
+; CHECK-NEXT: buffer_load_dword v17, v2, s[0:3], 0 offen offset:120
+; CHECK-NEXT: buffer_load_dword v16, v2, s[0:3], 0 offen offset:116
+; CHECK-NEXT: buffer_load_dword v15, v2, s[0:3], 0 offen offset:112
+; CHECK-NEXT: buffer_load_dword v22, v2, s[0:3], 0 offen offset:108
+; CHECK-NEXT: buffer_load_dword v21, v2, s[0:3], 0 offen offset:104
+; CHECK-NEXT: buffer_load_dword v20, v2, s[0:3], 0 offen offset:100
+; CHECK-NEXT: buffer_load_dword v19, v2, s[0:3], 0 offen offset:96
+; CHECK-NEXT: buffer_load_dword v26, v2, s[0:3], 0 offen offset:252
+; CHECK-NEXT: buffer_load_dword v25, v2, s[0:3], 0 offen offset:248
+; CHECK-NEXT: buffer_load_dword v24, v2, s[0:3], 0 offen offset:244
+; CHECK-NEXT: buffer_load_dword v23, v2, s[0:3], 0 offen offset:240
+; CHECK-NEXT: buffer_load_dword v30, v2, s[0:3], 0 offen offset:236
+; CHECK-NEXT: buffer_load_dword v29, v2, s[0:3], 0 offen offset:232
+; CHECK-NEXT: buffer_load_dword v28, v2, s[0:3], 0 offen offset:228
+; CHECK-NEXT: buffer_load_dword v27, v2, s[0:3], 0 offen offset:224
+; CHECK-NEXT: buffer_load_dword v34, v2, s[0:3], 0 offen offset:220
+; CHECK-NEXT: buffer_load_dword v33, v2, s[0:3], 0 offen offset:216
+; CHECK-NEXT: buffer_load_dword v32, v2, s[0:3], 0 offen offset:212
+; CHECK-NEXT: buffer_load_dword v31, v2, s[0:3], 0 offen offset:208
+; CHECK-NEXT: buffer_load_dword v38, v2, s[0:3], 0 offen offset:204
+; CHECK-NEXT: buffer_load_dword v37, v2, s[0:3], 0 offen offset:200
+; CHECK-NEXT: buffer_load_dword v36, v2, s[0:3], 0 offen offset:196
+; CHECK-NEXT: buffer_load_dword v35, v2, s[0:3], 0 offen offset:192
+; CHECK-NEXT: buffer_load_dword v51, v2, s[0:3], 0 offen offset:188
+; CHECK-NEXT: buffer_load_dword v50, v2, s[0:3], 0 offen offset:184
+; CHECK-NEXT: buffer_load_dword v49, v2, s[0:3], 0 offen offset:180
+; CHECK-NEXT: buffer_load_dword v48, v2, s[0:3], 0 offen offset:176
+; CHECK-NEXT: buffer_load_dword v55, v2, s[0:3], 0 offen offset:172
+; CHECK-NEXT: buffer_load_dword v54, v2, s[0:3], 0 offen offset:168
+; CHECK-NEXT: buffer_load_dword v53, v2, s[0:3], 0 offen offset:164
+; CHECK-NEXT: buffer_load_dword v52, v2, s[0:3], 0 offen offset:160
+; CHECK-NEXT: buffer_load_dword v67, v2, s[0:3], 0 offen offset:156
+; CHECK-NEXT: buffer_load_dword v66, v2, s[0:3], 0 offen offset:152
+; CHECK-NEXT: buffer_load_dword v65, v2, s[0:3], 0 offen offset:148
+; CHECK-NEXT: buffer_load_dword v64, v2, s[0:3], 0 offen offset:144
+; CHECK-NEXT: buffer_load_dword v71, v2, s[0:3], 0 offen offset:140
+; CHECK-NEXT: buffer_load_dword v70, v2, s[0:3], 0 offen offset:136
+; CHECK-NEXT: buffer_load_dword v69, v2, s[0:3], 0 offen offset:132
+; CHECK-NEXT: buffer_load_dword v68, v2, s[0:3], 0 offen offset:128
+; CHECK-NEXT: buffer_load_dword v83, v2, s[0:3], 0 offen offset:92
+; CHECK-NEXT: buffer_load_dword v82, v2, s[0:3], 0 offen offset:88
+; CHECK-NEXT: buffer_load_dword v81, v2, s[0:3], 0 offen offset:84
+; CHECK-NEXT: buffer_load_dword v80, v2, s[0:3], 0 offen offset:80
+; CHECK-NEXT: buffer_load_dword v87, v2, s[0:3], 0 offen offset:76
+; CHECK-NEXT: buffer_load_dword v86, v2, s[0:3], 0 offen offset:72
+; CHECK-NEXT: buffer_load_dword v85, v2, s[0:3], 0 offen offset:68
+; CHECK-NEXT: buffer_load_dword v84, v2, s[0:3], 0 offen offset:64
+; CHECK-NEXT: buffer_load_dword v96, v2, s[0:3], 0 offen
+; CHECK-NEXT: buffer_load_dword v97, v2, s[0:3], 0 offen offset:4
+; CHECK-NEXT: buffer_load_dword v98, v2, s[0:3], 0 offen offset:8
+; CHECK-NEXT: buffer_load_dword v3, v2, s[0:3], 0 offen offset:16
+; CHECK-NEXT: buffer_load_dword v99, v2, s[0:3], 0 offen offset:12
+; CHECK-NEXT: v_add_co_u32 v100, vcc_lo, v0, s4
+; CHECK-NEXT: v_add_co_ci_u32_e32 v101, vcc_lo, s5, v1, vcc_lo
+; CHECK-NEXT: v_add_nc_u32_e32 v2, 0xffffff00, v2
+; CHECK-NEXT: s_add_u32 s4, s4, 0xffffff00
+; CHECK-NEXT: s_addc_u32 s5, s5, -1
+; CHECK-NEXT: s_waitcnt vmcnt(41)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[23:26] offset:240
+; CHECK-NEXT: s_waitcnt vmcnt(37)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[27:30] offset:224
+; CHECK-NEXT: s_waitcnt vmcnt(33)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[31:34] offset:208
+; CHECK-NEXT: s_waitcnt vmcnt(29)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[35:38] offset:192
+; CHECK-NEXT: s_waitcnt vmcnt(25)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[48:51] offset:176
+; CHECK-NEXT: s_waitcnt vmcnt(21)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[52:55] offset:160
+; CHECK-NEXT: s_waitcnt vmcnt(17)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[64:67] offset:144
+; CHECK-NEXT: s_waitcnt vmcnt(13)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[68:71] offset:128
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[15:18] offset:112
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[19:22] offset:96
+; CHECK-NEXT: s_waitcnt vmcnt(9)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[80:83] offset:80
+; CHECK-NEXT: s_waitcnt vmcnt(5)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[84:87] offset:64
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[11:14] offset:48
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[7:10] offset:32
+; CHECK-NEXT: s_waitcnt vmcnt(1)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[3:6] offset:16
+; CHECK-NEXT: s_waitcnt vmcnt(0)
+; CHECK-NEXT: flat_store_dwordx4 v[100:101], v[96:99]
+; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
+; CHECK-NEXT: s_cbranch_scc0 .LBB9_5
+; CHECK-NEXT: .LBB9_6: ; %Flow14
+; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
+; CHECK-NEXT: s_waitcnt lgkmcnt(0)
+; CHECK-NEXT: s_setpc_b64 s[30:31]
+;
+; ALIGNED-LABEL: memmove_p0_p5_sz2048:
+; ALIGNED: ; %bb.0: ; %entry
+; ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; ALIGNED-NEXT: v_cmp_ne_u32_e32 vcc_lo, -1, v2
+; ALIGNED-NEXT: s_mov_b64 s[4:5], src_private_base
+; ALIGNED-NEXT: s_mov_b32 s4, exec_lo
+; ALIGNED-NEXT: v_cndmask_b32_e64 v4, 0, s5, vcc_lo
+; ALIGNED-NEXT: v_cndmask_b32_e32 v3, 0, v2, vcc_lo
+; ALIGNED-NEXT: v_cmpx_ge_u64_e64 v[3:4], v[0:1]
+; ALIGNED-NEXT: s_xor_b32 s6, exec_lo, s4
+; ALIGNED-NEXT: s_cbranch_execz .LBB9_3
+; ALIGNED-NEXT: ; %bb.1: ; %memmove_fwd_loop.preheader
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0
+; ALIGNED-NEXT: .LBB9_2: ; %memmove_fwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v7, v2, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v12, v2, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v13, v2, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v14, v2, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v15, v2, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v16, v2, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v17, v2, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v18, v2, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v19, v2, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v20, v2, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_co_u32 v3, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v4, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: v_add_nc_u32_e32 v2, 16, v2
+; ALIGNED-NEXT: s_add_u32 s4, s4, 16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, 0
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v5 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v6 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v7
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v8 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v9 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v10 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v11 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v12 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v13 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v14 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v15 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v16 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v17 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v18 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v19 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v20 offset:13
+; ALIGNED-NEXT: s_cmp_lg_u64 s[4:5], 0x800
+; ALIGNED-NEXT: s_cbranch_scc1 .LBB9_2
+; ALIGNED-NEXT: .LBB9_3: ; %Flow13
+; ALIGNED-NEXT: s_andn2_saveexec_b32 s6, s6
+; ALIGNED-NEXT: s_cbranch_execz .LBB9_6
+; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
+; ALIGNED-NEXT: v_add_nc_u32_e32 v2, 0x7f0, v2
+; ALIGNED-NEXT: s_mov_b64 s[4:5], 0x7f0
+; ALIGNED-NEXT: .LBB9_5: ; %memmove_bwd_loop
+; ALIGNED-NEXT: ; =>This Inner Loop Header: Depth=1
+; ALIGNED-NEXT: s_clause 0xf
+; ALIGNED-NEXT: buffer_load_ubyte v5, v2, s[0:3], 0 offen offset:2
+; ALIGNED-NEXT: buffer_load_ubyte v6, v2, s[0:3], 0 offen offset:3
+; ALIGNED-NEXT: buffer_load_ubyte v7, v2, s[0:3], 0 offen
+; ALIGNED-NEXT: buffer_load_ubyte v8, v2, s[0:3], 0 offen offset:1
+; ALIGNED-NEXT: buffer_load_ubyte v9, v2, s[0:3], 0 offen offset:6
+; ALIGNED-NEXT: buffer_load_ubyte v10, v2, s[0:3], 0 offen offset:7
+; ALIGNED-NEXT: buffer_load_ubyte v11, v2, s[0:3], 0 offen offset:4
+; ALIGNED-NEXT: buffer_load_ubyte v12, v2, s[0:3], 0 offen offset:5
+; ALIGNED-NEXT: buffer_load_ubyte v13, v2, s[0:3], 0 offen offset:10
+; ALIGNED-NEXT: buffer_load_ubyte v14, v2, s[0:3], 0 offen offset:11
+; ALIGNED-NEXT: buffer_load_ubyte v15, v2, s[0:3], 0 offen offset:8
+; ALIGNED-NEXT: buffer_load_ubyte v16, v2, s[0:3], 0 offen offset:9
+; ALIGNED-NEXT: buffer_load_ubyte v17, v2, s[0:3], 0 offen offset:14
+; ALIGNED-NEXT: buffer_load_ubyte v18, v2, s[0:3], 0 offen offset:15
+; ALIGNED-NEXT: buffer_load_ubyte v19, v2, s[0:3], 0 offen offset:12
+; ALIGNED-NEXT: buffer_load_ubyte v20, v2, s[0:3], 0 offen offset:13
+; ALIGNED-NEXT: v_add_co_u32 v3, vcc_lo, v0, s4
+; ALIGNED-NEXT: v_add_co_ci_u32_e32 v4, vcc_lo, s5, v1, vcc_lo
+; ALIGNED-NEXT: v_add_nc_u32_e32 v2, -16, v2
+; ALIGNED-NEXT: s_add_u32 s4, s4, -16
+; ALIGNED-NEXT: s_addc_u32 s5, s5, -1
+; ALIGNED-NEXT: s_waitcnt vmcnt(15)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v5 offset:2
+; ALIGNED-NEXT: s_waitcnt vmcnt(14)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v6 offset:3
+; ALIGNED-NEXT: s_waitcnt vmcnt(13)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v7
+; ALIGNED-NEXT: s_waitcnt vmcnt(12)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v8 offset:1
+; ALIGNED-NEXT: s_waitcnt vmcnt(11)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v9 offset:6
+; ALIGNED-NEXT: s_waitcnt vmcnt(10)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v10 offset:7
+; ALIGNED-NEXT: s_waitcnt vmcnt(9)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v11 offset:4
+; ALIGNED-NEXT: s_waitcnt vmcnt(8)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v12 offset:5
+; ALIGNED-NEXT: s_waitcnt vmcnt(7)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v13 offset:10
+; ALIGNED-NEXT: s_waitcnt vmcnt(6)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v14 offset:11
+; ALIGNED-NEXT: s_waitcnt vmcnt(5)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v15 offset:8
+; ALIGNED-NEXT: s_waitcnt vmcnt(4)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v16 offset:9
+; ALIGNED-NEXT: s_waitcnt vmcnt(3)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v17 offset:14
+; ALIGNED-NEXT: s_waitcnt vmcnt(2)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v18 offset:15
+; ALIGNED-NEXT: s_waitcnt vmcnt(1)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v19 offset:12
+; ALIGNED-NEXT: s_waitcnt vmcnt(0)
+; ALIGNED-NEXT: flat_store_byte v[3:4], v20 offset:13
+; ALIGNED-NEXT: s_cmp_eq_u64 s[4:5], -16
+; ALIGNED-NEXT: s_cbranch_scc0 .LBB9_5
+; ALIGNED-NEXT: .LBB9_6: ; %Flow14
+; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s6
+; ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
+; ALIGNED-NEXT: s_setpc_b64 s[30:31]
+entry:
+ tail call void @llvm.memmove.p0.p5.i64(ptr addrspace(0) noundef nonnull align 1 %dst, ptr addrspace(5) noundef nonnull align 1 %src, i64 2048, i1 false)
+ ret void
+}
+
+
+declare void @llvm.memcpy.p0.p0.i64(ptr addrspace(0) noalias nocapture writeonly, ptr addrspace(0) noalias nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memcpy.p1.p1.i64(ptr addrspace(1) noalias nocapture writeonly, ptr addrspace(1) noalias nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memcpy.p0.p4.i64(ptr addrspace(0) noalias nocapture writeonly, ptr addrspace(4) noalias nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memcpy.p5.p5.i64(ptr addrspace(5) noalias nocapture writeonly, ptr addrspace(5) noalias nocapture readonly, i64, i1 immarg) #2
+
+declare void @llvm.memcpy.p0.p5.i64(ptr addrspace(0) noalias nocapture writeonly, ptr addrspace(5) noalias nocapture readonly, i64, i1 immarg) #2
+
+declare void @llvm.memmove.p0.p0.i64(ptr addrspace(0) nocapture writeonly, ptr addrspace(0) nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memmove.p1.p1.i64(ptr addrspace(1) nocapture writeonly, ptr addrspace(1) nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memmove.p0.p4.i64(ptr addrspace(0) nocapture writeonly, ptr addrspace(4) nocapture readonly, i64, i1 immarg) #2
+declare void @llvm.memmove.p5.p5.i64(ptr addrspace(5) nocapture writeonly, ptr addrspace(5) nocapture readonly, i64, i1 immarg) #2
+
+declare void @llvm.memmove.p0.p5.i64(ptr addrspace(0) nocapture writeonly, ptr addrspace(5) nocapture readonly, i64, i1 immarg) #2
+
+attributes #2 = { nocallback nofree nounwind willreturn memory(argmem: readwrite) }
More information about the llvm-commits
mailing list