[llvm] sve gather scatter offset sinking (PR #66932)

Wed Sep 20 10:41:15 PDT 2023

llvmbot wrote:




@llvm/pr-subscribers-backend-aarch64

<details>
<summary>Changes</summary>

- [flang] Add comdats to functions with linkonce linkage (#66516)
- [clang][TSA] Thread safety cleanup functions
- [SPIRV] Test basic float and int types (#66282)
- [mlgo] Fix tests post PR #66334
- [libunwind][AIX] Fix up TOC register if unw_getcontext is called from a different module (#66549)
- [RISCV] Recognize veyron-v1 processor in clang driver. (#66703)
- [RISCV] Add a combine to form masked.store from unit strided store
- [SROA] Remove unnecessary IsStorePastEnd handling (NFCI)
- In ExprRequirement building, treat OverloadExpr as dependent (#66683)
- [mlir][SCF] `ForOp`: Remove `getIterArgNumberForOpOperand` (#66629)
- [mlir][Interfaces] `LoopLikeOpInterface`: Support ops with multiple regions (#66754)
- [DAGCombiner] Combine vp.strided.load with unit stride to vp.load (#66766)
- [DAGCombiner] Combine vp.strided.store with unit stride to vp.store (#66774)
- [TwoAddressInstruction] Use isPlainlyKilled in processTiedPairs (#65976)
- [RISCV] Fix bad isel predicate handling for Ztso. (#66739)
- [libc][math] Extract non-MPFR math tests into libc-math-smoke-tests.
- [lit] Drop "Script:", make -v and -a imply -vv
- [lit] Improve test output from lit's internal shell
- [lit] Echo full RUN lines in case of external shells (#66408)
- [RISCV] Add a pass to rewrite rd to x0 for non-computational instrs whose return values are unused
- [mlir][spirv][gpu] Convert remaining wmma ops to KHR coop matrix (#66455)
- [mlir][sparse] More allocate -> empty tensor migration (#66720)
- [gn build] Port 93fde2ea1b2c
- [RISCV] Add more instructions for the short forward branch optimization. (#66789)
- [SSP] Accessing __stack_chk_guard when using LTO (#66535)
- [RISCV] Expand test coverage for widening gather and strided load idioms
- [lldb][NFCI] Remove unneeded ConstString from intel-pt plugin (#66721)
- [lldb][NFCI] Remove unneccessary allocation in ScriptInterpreterPythonImpl::GetSyntheticTypeName (#66724)
- [Profile] Delete coverage-debug-info-correlate.cpp test on mac as debug info correlation not working on mac for unkown reasons.
- [lldb] Fix build after d5a62b78b8ae
- [flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)
- [RISCV] Require alignment when forming gather with larger element type
- Addressed review comments to use ThreadSafe instead of !ThreadSafe
- [flang] Follow memory source through more operations (#66713)
- [X86] Use RIP-relative addressing for data under large data threshold for medium code model
- Fix a bug with cancelling "attach -w" after you have run a process previously (#65822)
- Let the c(xx)_status pages reflect that clang 17 is released
- Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)"
- Revert "Revert "[flang][hlfir] Fixed assignment/finalization order for user-defined assignments. (#66736)""
- [ORC] Add writePointers to ExecutorProcessControl's MemoryAccess
- [Coverage] Skip visiting ctor member initializers with invalid source locations.
- [SLP]Fix PR66795: Check correct deps for vectorized inst with multiple vectorized node uses.
- [github] Make branch workflow more robust (#66781)
- [flang] Correct handling of assumed-rank allocatables in ALLOCATE (#66718)
- [BOLT][runtime] Test for outline-atomics support
- [mlir][spirv] Add conversions for Arith's `maxnumf` and `minnumf` (#66696)
- [libc++][NFC] Clean up std::__call_once
- [libc][cmake] Tidy compiler includes (#66783)
- [OpenMP][Docs][NFC] Update documentation
- [RISCV] Match strided load via DAG combine (#66800)
- [llvm-nm] Add --line-numbers flag
- Revert "[libc][cmake] Tidy compiler includes (#66783)" (#66822)
- [-Wunsafe-bugger-usage] Clean tests: remove nondeterministic ordering
- [mlir][sparse][gpu] free all buffers allocated for spGEMM (#66813)
- [llvm][docs] Update active CoC Commitee members (#66814)
- Explicitly set triple on line-numbers.test
- [AsmPrint] Dump raw frequencies in `-mbb-profile-dump` (#66818)
- [Clang] Static member initializers are not immediate escalating context. (#66021)
- [mlir][spirv] Suffix NV cooperative matrix props with `_nv` (#66820)
- [mlir][spirv] Define KHR cooperative matrix properties (#66823)
- [lit] Fix a test fail under windows
- [InstrProf][compiler-rt] Enable MC/DC Support in LLVM Source-based Code Coverage (1/3)
- [AMDGPU] Use inreg for hint to preload kernel arguments
- [EarlyCSE] Compare GEP instructions based on offset (#65875)
- [libc++] Fix __threading_support when used with C11 threading (#66780)
- [clang] Improve CI output when trailing whitespace is found (#66649)
- [libc] Fix printf config not working (#66834)
- [lit] Apply aa71680f2948's fix to an additional test
- [AMDGPU] Add ASM and MC updates for preloading kernargs
- [bazel] Port c649f29c24c9fc1502d8d53e0c96c3d24b31de1a (llvm-nm --line-numbers)
- Fix test added in D150987 to account for different path separators which was causing the test to fail on Windows.
- [SimplifyCFG] Pre-commit test for extending HoistThenElseCodeToIf.
- [SimplifyCFG] Hoist common instructions on Switch.
- [IR] Add "Large Data Threshold" module metadata (#66797)
- A test was changing directory and then incorrectly restoring the directory to the "testdir" which is the build directory for that test, not the original source directory.  That caused subsequent tests to fail.
- [mlir][sparse] unifies sparse_tensor.sort_coo/sort into one operation. (#66722)
- [Docs] Fix table after previous document update
- [Sparc] Remove LEA instructions (NFCI) (#65850)
- [lldb][NFCI] Remove unused struct ConstString::StringIsEqual
- [builtins][NFC] Avoid using CRT_LDBL_128BIT in tests (#66832)
- [RISCV] Prefer Zcmp push/pop instead of save-restore calls. (#66046)
- [DependencyScanningFilesystem] Make sure the local/shared cache filename lookups use only absolute paths (#66122)
- [NFC][hwasan] Make ShowHeapOrGlobalCandidate a method (#66682)
- [NFC][hwasan] Find overflow candidate early (#66682)
- [NFC][hwasan] Clang-format c557621176f5f38b5757a325cc72be0a11a91c78
- [NFC][hwasan] Extract a few BaseReport::Copy methods (#66682)
- [NFC][hwasan] Extract announce_by_id (#66682)
- [NFC][hwasan] Collect heap allocations early (#66682)
- [libc++] Warn if an unsupported compiler is used
- [ELF][test] Improve tests about non-SHF_ALLOC sections relocated by non-ABS relocations
- [ELF] Remove a R_ARM_PCA special case from relocateNonAlloc
- [clang][dataflow] Reorder checks to protect against a null pointer dereference. (#66764)
- [MC,X86] Property report error for modifiers with incorrect size
- [RISCV] Install sifive_vector.h to riscv-resource-headers (#66330)
- [InferAlignment] Create tests for InferAlignment pass
- [InferAlignment] Implement InferAlignmentPass
- [InstCombine] Use a cl::opt to control calls to getOrEnforceKnownAlignment in LoadInst and StoreInst
- [InferAlignment] Enable InferAlignment pass by default
- [ELF][test] Improve -r tests for local symbols
- [mlir][IR] Trigger `notifyOperationRemoved` callback for nested ops (#66771)
- [Workflow] Add new code format helper. (#66684)
- [gn build] Port 0f152a55d3e4
- [RISCV] Fix bugs about register list of Zcmp push/pop. (#66073)
- [AMDGPU] Run twoaddr tests with -early-live-intervals (#66775)
- [TableGen][GlobalISel] Use `GIM_SwitchOpcode` in Combiners (#66864)
- [NFC][InferAlignment] Swap extern declaration and definition of EnableInferAlignmentPass
- [flang] Prevent IR name clashes between BIND(C) and external procedures (#66777)
- Revert "[Workflow] Add new code format helper. (#66684)"
- [lldb][Docs] Fix typo in style docs
- [clang-format][NFC] Clean up signatures of some parser functions (#66569)
- Revert "Fix a bug with cancelling "attach -w" after you have run a process previously (#65822)"
- [OpenMP][VE] Limit the number of threads to create (#66729)
- [SimpleLoopUnswitch] Fix reversed branch during condition injection
- [mlir][vector] Make ReorderElementwiseOpsOnBroadcast support vector.splat (#66596)
- [lldb][AArch64] Add SME's streaming vector control register
- [reland][libc][cmake] Tidy compiler includes (#66783) (#66878)
- [GuardUtils] Revert llvm::isWidenableBranch change (#66411)
- [LLVM] convergence verifier should visit all instructions (#66200)
- [lldb][API] Remove debug print in TestRunLocker.py
- [clang] [C23] Fix crash with _BitInt running clang-tidy (#65889)
- [Flang][OpenMP] Move FIR lowering tests to a separate directory (#66779)
- [RISCV] Add missing V extensions for zvk-invalid-features.c (#66875)
- [mlir][gpu][bufferization] Implement BufferDeallocationOpInterface for gpu.terminator (#66880)
- [analyzer] Fix crash analyzing _BitInt() in evalIntegralCast (#66782)
- [IR] Fix a memory leak if Function::dropAllReferences() is followed by setHungoffOperand
- [X86] vector-interleaved tests - add AVX512-SLOW/AVX512-FAST common prefixes to reduce duplication
- [X86] combineINSERT_SUBVECTOR - attempt to combine concatenated shuffles
- [X86] Add test cases for gnux32 large constants Issue #55061
- [NFC][Clang] Address reviews about overrideFunctionFeaturesWithTargetFeatures (#65938)
- [analyzer] Fix StackAddrEscapeChecker crash on temporary object fields (#66493)
- [VE] Add unittest for intrinsics (#66730)
- [NFC][AMDGPU] Perform a single lookup in map in SIInsertWaitcnts::isPreheaderToFlush
- [NFC][AMDGPU] Remove redundant hasSideEffects=1
- [SROA] Don't shrink volatile load past end
- [mlir][bufferization][scf] Implement BufferDeallocationOpInterface for scf.reduce.return (#66886)
- [RISCV] Add tests where bin ops of splats could be scalarized. NFC (#65747)
- [clang][Interp][NFC] Small code refactoring
- [Docs] Update ExceptionHandling example (NFC)
- [mlir][bufferization][NFC] Move memref specific implementation of AllocationOpInterface to memref dialect directory (#66637)
- [X86] Align other variants to use void * as 512 variants. (#66310)
- [X86] Fix an assembler bug of CMPCCXADD. (#66748)
- [clang][dataflow] Identify post-visit state changes in the HTML logger. (#66746)
- [MLIR][Presburger] Template Matrix to allow MPInt and Fraction; use IntMatrix for integer matrices (#66897)
- [SPIR-V] Fix 64-bit integer literal printing (#66686)
- [libc++] Simplify how the global stream tests are written (#66842)
- [AArch64][SME] Enable TPIDR2 lazy-save for za_preserved
- [X86] X86DAGToDAGISel::matchIndexRecursively - replace hard coded recursion limit with SelectionDAG::MaxRecursionDepth. NFCI.
- [libc++] Sort available features before printing them
- [mlir][VectorOps] Extend vector.constant_mask to support 'all true' scalable dims (#66638)
- Warn on align directive with non-zero fill value in virtual sections (#66792)
- [VE] Add TargetParser to CMakeLists.txt for VE unittest
- [lldb-vscode] Use auto summaries whenever variables don't have a summary (#66551)
- Revert "[clang] Don't inherit dllimport/dllexport to exclude_from_explicit_instantiation members during explicit instantiation (#65961)"
- [AMDGPU] Convert tests rotr.ll and rotl.ll to be auto-generated (#66828)
- [NFC] Fix spelling 'constanst' -> 'constants'
- [mlir][Vector] Add fastmath flags to vector.reduction (#66905)
- [lldb][AArch64] Invalidate cached VG value before reconfiguring SVE registers
- [gn] Add dummy build file for VETests
- [SPIRV] Fix OpConstant float and double printing
- [flang][hlfir] Fixed cleanup code placement indeterminism in OrderedAssignments. (#66811)
- [AMDGPU] Regenerate always-uniform.ll
- [X86] Regenerate pr39098.ll
- [ELF][test] Add a test to demonstrate #66836
- [NFC][AsmPrinter] Refactor FrameIndexExprs as a std::set (#66433)
- [ELF] Postpone "unable to move location counter backward" error (#66854)
- [clang][CodeGen] The `eh_typeid_for` intrinsic needs special care too (#65699)
- [AArch64][GlobalISel] Adopt dup(load) -> LD1R patterns from SelectionDAG
- Cleanup fallback NOT checks
- [AArch64] Add some tests for setcc known bits fold. NFC
- [SelectionDAG] [NFC] Add pre-commit test for PR66701. (#66796)
- [Driver] Some improvements for path handling on NetBSD (#66863)
- [mlir][sparse] remove most bufferization.alloc_tensor ops from sparse (#66847)
- [mlir] Bazel fixes for 1b8b55644313216e6b0fa233bbd8b01fee23f99f (#66929)
- [mlir] introduce transform.loop.forall_to_for (#65474)
- [mlir] regenerate linalg named ops yaml (#65475)
- [SLP]Fix a crash when trying to find operand with re-vectorized main instruction.
- [libc][Obvious] Fix incorrect RPC opcode for `clearerr`
- [SVE][CodeGenPrepare] Sink address calculations that match SVE gather/scatter addressing modes.


---
Full diff: https://github.com/llvm/llvm-project/pull/66932.diff


2 Files Affected:

- (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+35) 
- (added) llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll (+231) 


``````````diff

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index ad01a206c93fb39..f80ce9239458730 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14380,6 +14380,31 @@ static bool areOperandsOfVmullHighP64(Value *Op1, Value *Op2) {
   return isOperandOfVmullHighP64(Op1) && isOperandOfVmullHighP64(Op2);
 }
 
+static bool shouldSinkVectorOfPtrs(Value* Ptrs, SmallVectorImpl<Use *> &Ops) {
+  // Restrict ourselves to the form CodeGenPrepare typically constructs.
+  auto *GEP = dyn_cast<GetElementPtrInst>(Ptrs);
+  if (!GEP || GEP->getNumOperands() != 2)
+    return false;
+
+  Value *Base = GEP->getOperand(0);
+  Value *Offsets = GEP->getOperand(1);
+
+  // We only care about scalar_base+vector_offsets.
+  if (Base->getType()->isVectorTy() || !Offsets->getType()->isVectorTy())
+    return false;
+
+  // Sink extends that would allow us to use 32-bit offset vectors.
+  if (isa<SExtInst>(Offsets) || isa<ZExtInst>(Offsets)) {
+    auto *OffsetsInst = cast<Instruction>(Offsets);
+    if (OffsetsInst->getType()->getScalarSizeInBits() > 32 &&
+        OffsetsInst->getOperand(0)->getType()->getScalarSizeInBits() <= 32)
+      Ops.push_back(&GEP->getOperandUse(1));
+  }
+
+  // Sink the GEP.
+  return true;
+}
+
 /// Check if sinking \p I's operands to I's basic block is profitable, because
 /// the operands can be folded into a target instruction, e.g.
 /// shufflevectors extracts and/or sext/zext can be folded into (u,s)subl(2).
@@ -14481,6 +14506,16 @@ bool AArch64TargetLowering::shouldSinkOperands(
       Ops.push_back(&II->getArgOperandUse(0));
       Ops.push_back(&II->getArgOperandUse(1));
       return true;
+    case Intrinsic::masked_gather:
+      if (!shouldSinkVectorOfPtrs(II->getArgOperand(0), Ops))
+        return false;
+      Ops.push_back(&II->getArgOperandUse(0));
+      return true;
+    case Intrinsic::masked_scatter:
+      if (!shouldSinkVectorOfPtrs(II->getArgOperand(1), Ops))
+        return false;
+      Ops.push_back(&II->getArgOperandUse(1));
+      return true;
     default:
       return false;
     }
diff --git a/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll b/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll
new file mode 100644
index 000000000000000..73322836d1b84a7
--- /dev/null
+++ b/llvm/test/Transforms/CodeGenPrepare/AArch64/sink-gather-scatter-addressing.ll
@@ -0,0 +1,231 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
+; RUN: opt -S --codegenprepare < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; Sink the GEP to make use of scalar+vector addressing modes.
+define <vscale x 4 x float> @gather_offsets_sink_gep(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_gep(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i32> [[INDICES]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i32> %indices
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Sink sext to make use of scalar+sxtw(vector) addressing modes.
+define <vscale x 4 x float> @gather_offsets_sink_sext(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_sext(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; As above but ensure both the GEP and sext is sunk.
+define <vscale x 4 x float> @gather_offsets_sink_sext_get(ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offsets_sink_sext_get(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP1]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink GEPs that cannot benefit from SVE's scalar+vector addressing modes.
+define <vscale x 4 x float> @gather_no_scalar_base(<vscale x 4 x ptr> %bases, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_no_scalar_base(
+; CHECK-SAME: <vscale x 4 x ptr> [[BASES:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, <vscale x 4 x ptr> [[BASES]], <vscale x 4 x i32> [[INDICES]]
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %ptrs = getelementptr float, <vscale x 4 x ptr> %bases, <vscale x 4 x i32> %indices
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink extends whose result type is already favourable for SVE's sxtw/uxtw addressing modes.
+; NOTE: We still want to sink the GEP.
+define <vscale x 4 x float> @gather_offset_type_too_small(ptr %base, <vscale x 4 x i8> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_type_too_small(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i8> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[INDICES_SEXT:%.*]] = sext <vscale x 4 x i8> [[INDICES]] to <vscale x 4 x i32>
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i32> [[INDICES_SEXT]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i8> %indices to <vscale x 4 x i32>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i32> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Don't sink extends that cannot benefit from SVE's sxtw/uxtw addressing modes.
+; NOTE: We still want to sink the GEP.
+define <vscale x 4 x float> @gather_offset_type_too_big(ptr %base, <vscale x 4 x i48> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_type_too_big(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i48> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    [[INDICES_SEXT:%.*]] = sext <vscale x 4 x i48> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[INDICES_SEXT]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[TMP0]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i48> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Sink zext to make use of scalar+uxtw(vector) addressing modes.
+; TODO: There's an argument here to split the extend into i8->i32 and i32->i64,
+; which would be especially useful if the i8s are the result of a load because
+; it would maintain the use of sign-extending loads.
+define <vscale x 4 x float> @gather_offset_sink_zext(ptr %base, <vscale x 4 x i8> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define <vscale x 4 x float> @gather_offset_sink_zext(
+; CHECK-SAME: ptr [[BASE:%.*]], <vscale x 4 x i8> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = zext <vscale x 4 x i8> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[PTRS:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    [[LOAD:%.*]] = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0(<vscale x 4 x ptr> [[PTRS]], i32 4, <vscale x 4 x i1> [[MASK]], <vscale x 4 x float> poison)
+; CHECK-NEXT:    ret <vscale x 4 x float> [[LOAD]]
+; CHECK:       exit:
+; CHECK-NEXT:    ret <vscale x 4 x float> zeroinitializer
+;
+entry:
+  %indices.zext = zext <vscale x 4 x i8> %indices to <vscale x 4 x i64>
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.zext
+  %load = tail call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask, <vscale x 4 x float> poison)
+  br label %exit
+
+exit:
+  %ret = phi <vscale x 4 x float> [ zeroinitializer, %entry ], [ %load, %cond.block ]
+  ret <vscale x 4 x float> %ret
+}
+
+; Ensure we support scatters as well as gathers.
+define void @scatter_offsets_sink_sext_get(<vscale x 4 x float> %data, ptr %base, <vscale x 4 x i32> %indices, <vscale x 4 x i1> %mask, i1 %cond) {
+; CHECK-LABEL: define void @scatter_offsets_sink_sext_get(
+; CHECK-SAME: <vscale x 4 x float> [[DATA:%.*]], ptr [[BASE:%.*]], <vscale x 4 x i32> [[INDICES:%.*]], <vscale x 4 x i1> [[MASK:%.*]], i1 [[COND:%.*]]) {
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:    br i1 [[COND]], label [[COND_BLOCK:%.*]], label [[EXIT:%.*]]
+; CHECK:       cond.block:
+; CHECK-NEXT:    [[TMP0:%.*]] = sext <vscale x 4 x i32> [[INDICES]] to <vscale x 4 x i64>
+; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr float, ptr [[BASE]], <vscale x 4 x i64> [[TMP0]]
+; CHECK-NEXT:    tail call void @llvm.masked.scatter.nxv4f32.nxv4p0(<vscale x 4 x float> [[DATA]], <vscale x 4 x ptr> [[TMP1]], i32 4, <vscale x 4 x i1> [[MASK]])
+; CHECK-NEXT:    ret void
+; CHECK:       exit:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %indices.sext = sext <vscale x 4 x i32> %indices to <vscale x 4 x i64>
+  %ptrs = getelementptr float, ptr %base, <vscale x 4 x i64> %indices.sext
+  br i1 %cond, label %cond.block, label %exit
+
+cond.block:
+  tail call void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float> %data, <vscale x 4 x ptr> %ptrs, i32 4, <vscale x 4 x i1> %mask)
+  br label %exit
+
+exit:
+  ret void
+}
+
+declare <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x ptr>, i32, <vscale x 4 x i1>, <vscale x 4 x float>)
+declare void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float>, <vscale x 4 x ptr>, i32, <vscale x 4 x i1>)

``````````

</details>


https://github.com/llvm/llvm-project/pull/66932