[llvm] [SROA] Avoid redundant `.oldload` generation when `memset` fully covers a partition (PR #179643)

Wed Feb 4 04:44:02 PST 2026

llvmbot wrote:




@llvm/pr-subscribers-llvm-transforms

Author: None (int-zjt)

<details>
<summary>Changes</summary>

In our internal (ByteDance) builds we frequently hit very large `DeadPhiWeb`s that cause serious compile-time slowdowns, especially in some auto-generated code where a single file can take 20+ minutes to compile. There were previous attempts to reduce `DeadPhiWeb` in `InstCombine` (e.g. llvm/llvm-project#108876 and llvm/llvm-project#158057), but in our workload we still see a lot of time spent later in the pipeline (notably `JumpThreading` and `CorrelatedValuePropagation`).

After digging into our cases, a big chunk of the `DeadPhiWeb` comes from SROA rewriting `memset`s. We often end up with patterns like:
```
%.sroa.xxx.oldload = load <ty>, ptr %.sroa.xxx
%unused = ptrtoint ptr %.sroa.xxx.oldload to i64   ; or a bitcast-like use
store <ty> <new_value>, ptr %.sroa.xxx
```
Even if `%unused` is cleaned up by later DCE-style passes, the load/store shape can still make `PromoteMem2Reg` conservatively treat many blocks as live-in when computing IDF. With cyclic CFGs this can easily create large, sticky dead phi webs, and the rest of the pipeline pays for it.

The core issue is that `visitMemSetInst` was using the slice’s original offsets (`BeginOffset`/`EndOffset`) when deciding whether it needs to merge with an `.oldload` to preserve bytes not written by the `memset`. First, there was a typo in the original condition (`EndOffset != NewAllocaBeginOffset` instead of `EndOffset != NewAllocaEndOffset`), which effectively made the check always true and forced the merge path in most cases. Second, even if the typo is fixed, comparing the original slice range against the partition bounds is still too strict: cases where the `memset` contains the partition (e.g. a large `memset` over the whole alloca while the partition is just a subrange) would still be misclassified as requiring an `.oldload`. Both issues lead to many redundant loads and downstream dead phi webs.

This change switches the check to use the already-computed intersection offsets (`NewBeginOffset`/`NewEndOffset`) against the partition bounds, so we only generate `.oldload` when the `memset` actually writes only part of the partition:
```diff
- if (IntTy && (BeginOffset != NewAllocaBeginOffset ||
-               EndOffset   != NewAllocaEndOffset)) {
+ if (IntTy && (NewBeginOffset != NewAllocaBeginOffset ||
+               NewEndOffset   != NewAllocaEndOffset)) {
    ; emit oldload + insertInteger merge
  }
```
In our workload this cuts down a lot of pointless `.oldload`s and helps reduce the size of dead phi webs seen after `mem2reg`, improving compile time without changing semantics (partial overwrites still merge, full overwrites don’t).

---
Full diff: https://github.com/llvm/llvm-project/pull/179643.diff


3 Files Affected:

- (modified) llvm/lib/Transforms/Scalar/SROA.cpp (+2-2) 
- (modified) llvm/test/Transforms/SROA/basictest.ll (+7-9) 
- (modified) llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll (-2) 


``````````diff

diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 83eabdae3db7f..56f2ed167a50b 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3643,8 +3643,8 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       uint64_t Size = NewEndOffset - NewBeginOffset;
       V = getIntegerSplat(II.getValue(), Size);
 
-      if (IntTy && (BeginOffset != NewAllocaBeginOffset ||
-                    EndOffset != NewAllocaBeginOffset)) {
+      if (IntTy && (NewBeginOffset != NewAllocaBeginOffset ||
+                    NewEndOffset != NewAllocaEndOffset)) {
         Value *Old = IRB.CreateAlignedLoad(NewAllocaTy, &NewAI,
                                            NewAI.getAlign(), "oldload");
         Old = convertValue(DL, IRB, Old, IntTy);
diff --git a/llvm/test/Transforms/SROA/basictest.ll b/llvm/test/Transforms/SROA/basictest.ll
index 15803f7b5a25b..1d2a00e7f2380 100644
--- a/llvm/test/Transforms/SROA/basictest.ll
+++ b/llvm/test/Transforms/SROA/basictest.ll
@@ -529,7 +529,6 @@ entry:
 define ptr @test10() {
 ; CHECK-LABEL: @test10(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint ptr null to i64
 ; CHECK-NEXT:    ret ptr null
 ;
 entry:
@@ -1083,20 +1082,19 @@ define void @PR14059.1(ptr %d) {
 ; CHECK-NEXT:    [[X_SROA_0_I_2_INSERT_MASK:%.*]] = and i64 [[TMP2]], -281474976645121
 ; CHECK-NEXT:    [[X_SROA_0_I_2_INSERT_INSERT:%.*]] = or i64 [[X_SROA_0_I_2_INSERT_MASK]], 0
 ; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i64 [[X_SROA_0_I_2_INSERT_INSERT]] to double
-; CHECK-NEXT:    [[TMP4:%.*]] = bitcast double [[TMP3]] to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_COPYLOAD:%.*]] = load i32, ptr [[D:%.*]], align 1
-; CHECK-NEXT:    [[TMP5:%.*]] = bitcast double 0.000000e+00 to i64
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast double 0.000000e+00 to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_EXT:%.*]] = zext i32 [[X_SROA_0_I_4_COPYLOAD]] to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_SHIFT:%.*]] = shl i64 [[X_SROA_0_I_4_INSERT_EXT]], 32
-; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK3:%.*]] = and i64 [[TMP5]], 4294967295
+; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK3:%.*]] = and i64 [[TMP4]], 4294967295
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_INSERT4:%.*]] = or i64 [[X_SROA_0_I_4_INSERT_MASK3]], [[X_SROA_0_I_4_INSERT_SHIFT]]
-; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT4]] to double
-; CHECK-NEXT:    [[TMP7:%.*]] = bitcast double [[TMP6]] to i64
-; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK:%.*]] = and i64 [[TMP7]], 4294967295
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT4]] to double
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast double [[TMP5]] to i64
+; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK:%.*]] = and i64 [[TMP6]], 4294967295
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_INSERT:%.*]] = or i64 [[X_SROA_0_I_4_INSERT_MASK]], 4607182418800017408
-; CHECK-NEXT:    [[TMP8:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT]] to double
+; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT]] to double
 ; CHECK-NEXT:    [[ACCUM_REAL_I:%.*]] = load double, ptr [[D]], align 8
-; CHECK-NEXT:    [[ADD_R_I:%.*]] = fadd double [[ACCUM_REAL_I]], [[TMP8]]
+; CHECK-NEXT:    [[ADD_R_I:%.*]] = fadd double [[ACCUM_REAL_I]], [[TMP7]]
 ; CHECK-NEXT:    store double [[ADD_R_I]], ptr [[D]], align 8
 ; CHECK-NEXT:    ret void
 ;
diff --git a/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll b/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
index 72014912edd20..197c8e6908ae3 100644
--- a/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
+++ b/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
@@ -249,8 +249,6 @@ define amdgpu_kernel void @test_half_array() #0 {
 ; CHECK-NEXT:    [[B_BLOCKWISE_COPY_SROA_4:%.*]] = alloca float, align 4
 ; CHECK-NEXT:    call void @llvm.memset.p0.i32(ptr align 16 [[B_BLOCKWISE_COPY_SROA_0]], i8 0, i32 4, i1 false)
 ; CHECK-NEXT:    call void @llvm.memset.p0.i32(ptr align 4 [[B_BLOCKWISE_COPY_SROA_4]], i8 0, i32 4, i1 false)
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast float undef to i32
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float undef to i32
 ; CHECK-NEXT:    [[DATA:%.*]] = load [4 x float], ptr undef, align 4
 ; CHECK-NEXT:    [[DATA_FCA_0_EXTRACT:%.*]] = extractvalue [4 x float] [[DATA]], 0
 ; CHECK-NEXT:    store float [[DATA_FCA_0_EXTRACT]], ptr [[B_BLOCKWISE_COPY_SROA_0]], align 16

``````````

</details>


https://github.com/llvm/llvm-project/pull/179643