[llvm] [SLP]Improved reduction cost/codegen (PR #118293)

Mon Jan 27 08:36:06 PST 2025

================
@@ -45,11 +45,19 @@ define float @test(ptr %x) {
 ; CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[ARRAYIDX_28]], align 4
 ; CHECK-NEXT:    [[ARRAYIDX_29:%.*]] = getelementptr inbounds float, ptr [[X]], i64 30
 ; CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[ARRAYIDX_29]], align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = call fast float @llvm.vector.reduce.fadd.v16f32(float 0.000000e+00, <16 x float> [[TMP0]])
-; CHECK-NEXT:    [[TMP6:%.*]] = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> [[TMP1]])
-; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast float [[TMP5]], [[TMP6]]
-; CHECK-NEXT:    [[TMP7:%.*]] = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> [[TMP2]])
-; CHECK-NEXT:    [[OP_RDX1:%.*]] = fadd fast float [[OP_RDX]], [[TMP7]]
+; CHECK-NEXT:    [[TMP5:%.*]] = call fast <4 x float> @llvm.vector.extract.v4f32.v16f32(<16 x float> [[TMP0]], i64 0)
----------------
preames wrote:

Er, this does not look an improvement.  Assuming zvl128b (default), these extracts are going to translate to vslidedown.vx at m4 and m2 respectively.  3x m4 + 2 x m2 + m1 reduce is going to be more expensive than 1x each m4, m2 and m1 reductions plus scalar adds.  The new lowering here only looks profitable if we know the exact VLEN is 128, not if the lower bound on VLEN is 128.

There's an alternate lowering here which would likely work better.  If you sum the smaller vectors into the larger vectors (via extract, add, and insert), I think we'll be smart enough to turn that into a VL predicated vfadd.vv at the smaller lmul (check!).  If so, then that would be 1x m1 vfadd.vv + 1x m2 vfadd.vv + 1 m4 reduce.  


https://github.com/llvm/llvm-project/pull/118293