[llvm] [SROA] Avoid redundant `.oldload` generation when `memset` fully covers a partition (PR #179643)

Wed Feb 4 03:29:03 PST 2026

https://github.com/int-zjt created https://github.com/llvm/llvm-project/pull/179643

##### Background: Compile-time issues caused by DeadPhiWeb
In our internal build environment we see severe compile-time regressions caused by large `DeadPhiWeb`s, especially in some auto-generated codebases where compilation time can exceed 20 minutes.
Previous upstream attempts tried to mitigate `DeadPhiWeb` primarily in `InstCombine`:
- https://github.com/llvm/llvm-project/pull/108876
- https://github.com/llvm/llvm-project/pull/158057/changes
These patches help in some cases, but we still observe significant compile-time spent in passes such as JumpThreading and CorrelatedValuePropagation, indicating that the root cause is often introduced earlier in the pipeline.
##### Root cause in our workload: redundant oldload after SROA memset rewriting
We found that in our workload, a major source of `DeadPhiWeb` originates from SROA rewriting of `memset`. After SROA, many `memset` operations are lowered into patterns resembling:
```
%.sroa.xxx.oldload = load <ty>, ptr %.sroa.xxx
%unused = ptrtoint ptr %.sroa.xxx.oldload to i64 ; or a bitcast-like instruction
store <ty> <new_value>, ptr %.sroa.xxx
```
Although `%unused` (and sometimes the load itself) may later be removed by DCE-like passes, it can still cause significant overhead in PromoteMem2Reg. During IDF (iterated dominance frontier) computation, a value with a load-before-store pattern (e.g. `%.sroa.xxx`) tends to mark many predecessor blocks as `LiveIn`. If those blocks form cycles, this can result in persistent and expensive-to-process `DeadPhiWeb`s.
##### Why this happens: incorrect full-coverage check for memset vs. partition
In `SROA::AllocaSliceRewriter::visitMemSetInst`, SROA needs to decide whether it is performing:
- a partial `memset` of the new alloca partition (requires merging with the old value via `.oldload` to preserve bytes not written), or
- a full overwrite of the partition (no need for `.oldload`).
The correct criterion should be based on the intersection range of the slice and the partition:
- Partition range: [Pbegin,Pend)[P_{begin}, P_{end})[Pbegin,Pend)
- Intersected range: [NewBeginOffset,NewEndOffset)[NewBeginOffset, NewEndOffset)[NewBeginOffset,NewEndOffset)
A merge is only required when:
[NewBeginOffset,NewEndOffset)≠[Pbegin,Pend)[NewBeginOffset, NewEndOffset) \neq [P_{begin}, P_{end})[NewBeginOffset,NewEndOffset)=[Pbegin,Pend)
However, the previous implementation compared against the original slice offsets (or effectively used an incorrect comparison), which caused the merge path to be taken even when the `memset` fully covered the partition. This unnecessarily introduced `.oldload` instructions and their downstream fallout.
##### This change
This patch refines the condition to compare the intersected range (`NewBeginOffset`/`NewEndOffset`) against the partition bounds (`NewAllocaBeginOffset`/`NewAllocaEndOffset`), and only emits `.oldload` when the `memset` does not fully cover the partition.
``` 
- if (IntTy && (BeginOffset != NewAllocaBeginOffset ||
-               EndOffset   != NewAllocaEndOffset)) {
+ if (IntTy && (NewBeginOffset != NewAllocaBeginOffset ||
+               NewEndOffset   != NewAllocaEndOffset)) {
    ; emit oldload + insertInteger merge
  }
```
##### Impact
- Reduces redundant loads generated by SROA in common “large memset + partitioned alloca” patterns.
- Mitigates formation of `DeadPhiWeb` introduced before `mem2reg`, reducing downstream compile-time in passes like JumpThreading / CVP in our workload.
- Expected to be a compile-time optimization with no semantic change: `.oldload` is still emitted for true partial overwrites.

>From fffd3855a59204de816d63fff213d2baaaa33ab0 Mon Sep 17 00:00:00 2001
From: "zhangjiatong.0" <zhangjiatong.0 at bytedance.com>
Date: Tue, 3 Feb 2026 19:38:51 +0800
Subject: [PATCH] [SROA] fix always-true condition in visitMemSetInst

---
 llvm/lib/Transforms/Scalar/SROA.cpp              |  4 ++--
 llvm/test/Transforms/SROA/basictest.ll           | 16 +++++++---------
 .../SROA/sroa-common-type-fail-promotion.ll      |  2 --
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 83eabdae3db7f..56f2ed167a50b 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -3643,8 +3643,8 @@ class AllocaSliceRewriter : public InstVisitor<AllocaSliceRewriter, bool> {
       uint64_t Size = NewEndOffset - NewBeginOffset;
       V = getIntegerSplat(II.getValue(), Size);
 
-      if (IntTy && (BeginOffset != NewAllocaBeginOffset ||
-                    EndOffset != NewAllocaBeginOffset)) {
+      if (IntTy && (NewBeginOffset != NewAllocaBeginOffset ||
+                    NewEndOffset != NewAllocaEndOffset)) {
         Value *Old = IRB.CreateAlignedLoad(NewAllocaTy, &NewAI,
                                            NewAI.getAlign(), "oldload");
         Old = convertValue(DL, IRB, Old, IntTy);
diff --git a/llvm/test/Transforms/SROA/basictest.ll b/llvm/test/Transforms/SROA/basictest.ll
index 15803f7b5a25b..1d2a00e7f2380 100644
--- a/llvm/test/Transforms/SROA/basictest.ll
+++ b/llvm/test/Transforms/SROA/basictest.ll
@@ -529,7 +529,6 @@ entry:
 define ptr @test10() {
 ; CHECK-LABEL: @test10(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[TMP0:%.*]] = ptrtoint ptr null to i64
 ; CHECK-NEXT:    ret ptr null
 ;
 entry:
@@ -1083,20 +1082,19 @@ define void @PR14059.1(ptr %d) {
 ; CHECK-NEXT:    [[X_SROA_0_I_2_INSERT_MASK:%.*]] = and i64 [[TMP2]], -281474976645121
 ; CHECK-NEXT:    [[X_SROA_0_I_2_INSERT_INSERT:%.*]] = or i64 [[X_SROA_0_I_2_INSERT_MASK]], 0
 ; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i64 [[X_SROA_0_I_2_INSERT_INSERT]] to double
-; CHECK-NEXT:    [[TMP4:%.*]] = bitcast double [[TMP3]] to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_COPYLOAD:%.*]] = load i32, ptr [[D:%.*]], align 1
-; CHECK-NEXT:    [[TMP5:%.*]] = bitcast double 0.000000e+00 to i64
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast double 0.000000e+00 to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_EXT:%.*]] = zext i32 [[X_SROA_0_I_4_COPYLOAD]] to i64
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_SHIFT:%.*]] = shl i64 [[X_SROA_0_I_4_INSERT_EXT]], 32
-; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK3:%.*]] = and i64 [[TMP5]], 4294967295
+; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK3:%.*]] = and i64 [[TMP4]], 4294967295
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_INSERT4:%.*]] = or i64 [[X_SROA_0_I_4_INSERT_MASK3]], [[X_SROA_0_I_4_INSERT_SHIFT]]
-; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT4]] to double
-; CHECK-NEXT:    [[TMP7:%.*]] = bitcast double [[TMP6]] to i64
-; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK:%.*]] = and i64 [[TMP7]], 4294967295
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT4]] to double
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast double [[TMP5]] to i64
+; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_MASK:%.*]] = and i64 [[TMP6]], 4294967295
 ; CHECK-NEXT:    [[X_SROA_0_I_4_INSERT_INSERT:%.*]] = or i64 [[X_SROA_0_I_4_INSERT_MASK]], 4607182418800017408
-; CHECK-NEXT:    [[TMP8:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT]] to double
+; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64 [[X_SROA_0_I_4_INSERT_INSERT]] to double
 ; CHECK-NEXT:    [[ACCUM_REAL_I:%.*]] = load double, ptr [[D]], align 8
-; CHECK-NEXT:    [[ADD_R_I:%.*]] = fadd double [[ACCUM_REAL_I]], [[TMP8]]
+; CHECK-NEXT:    [[ADD_R_I:%.*]] = fadd double [[ACCUM_REAL_I]], [[TMP7]]
 ; CHECK-NEXT:    store double [[ADD_R_I]], ptr [[D]], align 8
 ; CHECK-NEXT:    ret void
 ;
diff --git a/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll b/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
index 72014912edd20..197c8e6908ae3 100644
--- a/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
+++ b/llvm/test/Transforms/SROA/sroa-common-type-fail-promotion.ll
@@ -249,8 +249,6 @@ define amdgpu_kernel void @test_half_array() #0 {
 ; CHECK-NEXT:    [[B_BLOCKWISE_COPY_SROA_4:%.*]] = alloca float, align 4
 ; CHECK-NEXT:    call void @llvm.memset.p0.i32(ptr align 16 [[B_BLOCKWISE_COPY_SROA_0]], i8 0, i32 4, i1 false)
 ; CHECK-NEXT:    call void @llvm.memset.p0.i32(ptr align 4 [[B_BLOCKWISE_COPY_SROA_4]], i8 0, i32 4, i1 false)
-; CHECK-NEXT:    [[TMP0:%.*]] = bitcast float undef to i32
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast float undef to i32
 ; CHECK-NEXT:    [[DATA:%.*]] = load [4 x float], ptr undef, align 4
 ; CHECK-NEXT:    [[DATA_FCA_0_EXTRACT:%.*]] = extractvalue [4 x float] [[DATA]], 0
 ; CHECK-NEXT:    store float [[DATA_FCA_0_EXTRACT]], ptr [[B_BLOCKWISE_COPY_SROA_0]], align 16