[llvm] [LoopReduceMotion] Improve loop by extract reduction instruction (PR #179215)

Fri Mar 6 03:49:59 PST 2026

================
@@ -9,361 +9,325 @@ define dso_local i32 @test(ptr noundef %p1, i32 noundef %s_p1, ptr noundef %p2,
 ; CHECK-O3-LABEL: define dso_local i32 @test(
 ; CHECK-O3-SAME: ptr noundef readonly captures(none) [[P1:%.*]], i32 noundef [[S_P1:%.*]], ptr noundef readonly captures(none) [[P2:%.*]], i32 noundef [[S_P2:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
 ; CHECK-O3-NEXT:  [[ENTRY:.*:]]
-; CHECK-O3-NEXT:    [[IDX_EXT8:%.*]] = sext i32 [[S_P2]] to i64
-; CHECK-O3-NEXT:    [[IDX_EXT:%.*]] = sext i32 [[S_P1]] to i64
 ; CHECK-O3-NEXT:    [[TMP0:%.*]] = load <16 x i8>, ptr [[P1]], align 1, !tbaa [[CHAR_TBAA0:![0-9]+]]
 ; CHECK-O3-NEXT:    [[TMP1:%.*]] = zext <16 x i8> [[TMP0]] to <16 x i16>
 ; CHECK-O3-NEXT:    [[TMP2:%.*]] = load <16 x i8>, ptr [[P2]], align 1, !tbaa [[CHAR_TBAA0]]
 ; CHECK-O3-NEXT:    [[TMP3:%.*]] = zext <16 x i8> [[TMP2]] to <16 x i16>
 ; CHECK-O3-NEXT:    [[TMP4:%.*]] = sub nsw <16 x i16> [[TMP1]], [[TMP3]]
 ; CHECK-O3-NEXT:    [[TMP5:%.*]] = tail call <16 x i16> @llvm.abs.v16i16(<16 x i16> [[TMP4]], i1 false)
 ; CHECK-O3-NEXT:    [[TMP6:%.*]] = zext <16 x i16> [[TMP5]] to <16 x i32>
-; CHECK-O3-NEXT:    [[TMP7:%.*]] = tail call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP6]])
+; CHECK-O3-NEXT:    [[IDX_EXT:%.*]] = sext i32 [[S_P1]] to i64
----------------
davemgreen wrote:

Yeah this might be difficult to get right. I don't think anyone has attempted to accurately cost-model reductions after the loop vectorizer like this before. (I'm not even sure our costs are very accurate).

You might need to just use preferInLoopReduction to rule out the massive and numerous MVE regressions this will cause. getExtendedReductionCost and getMulAccReductionCost should give better better costs for extending reduction costs.

This case is full unrolled so theoretically, worst case, we can fix it in the backend. I'm not sure how difficult that would be and with too many transforms it becomes impossible to undo. The IR code doesn't look quite like it should because the abs(sub(zext, zext)) can turn into a zext(uabd) with a smaller size, so the extend is larger. If we need to we could be transforming that abs+sub+zexts to zext(sub(max,min)), and have the backend transform it back.

https://github.com/llvm/llvm-project/pull/179215