[llvm] [AArch64][CostModel] Consider i32 --> i64 partial reduce cost as Invalid for FixedLength vectors (PR #165226)

Mon Oct 27 03:06:24 PDT 2025

sushgokh wrote:

Lets consider the sequence of events that happened
```
Timeline:  T1                     T2                              T3
		Initial pt ---> PR #158641 by Sander Smalen ---> PR #163728  by me ---> Current state
State:	(Good code)       (Bad code)                      (super bad code)
```
**State T1:**
Code vectorized with VF=4. Godbolt link for IR and codegen: https://godbolt.org/z/xrGer5fb1

**State T2:**
Code vectorized with VF=2. Godbolt link for IR and codegen: https://godbolt.org/z/MvrKGeKc1

**State T3:**
Code with VF=4 partial reduce intrinsics. Godbolt link for IR and codegen: https://godbolt.org/z/9bqWbds4x

SMLAL[BT], as per SWOG, have following characteristics:
1. Lat=4, thru=2
2. Overhead of 1 cycle when forwarding the result to a chained instruction with same dest operand.

Compare this with lat=2/thru=4 of SADDW[2] and hence the states shown above(e.g. good, bad, super bad).

Now, at the IR level, generating partial reduce instrinsic would be the right thing do excep that the codegen is bad.

**What alternatives do we have to address this issue?**
**Alternative 1**: Comment out/make conditional https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L1980. 
With this, we get back to state T2. We are still left with code with VF=2. 
I am not sure if tweaking the cost model would help unroll the code twice so that it matches performance with T1. Other issue is once initial VPlan considers using partial reduce intrinsic with VF=4, it does not consider VPlan without the intrinsic(this is just by looking at the debug logs and I havent went into the details).

**Alternative 2**: Mark this partial reduce scenario as invalid as was in T1.
This helps us get back to the original performance.

This patch tries to go with alternative 2.

**What else needs to be addressed?**
1. Standalone codegen test
Instead of generating SMLAL[BT] in https://godbolt.org/z/59PxsvGfG for above code, we need to generate SADDW[2] or SADDL[BT6]

**Note**: It is difficult to know all the partial reduce patterns for all the Numba instances failing(due to certain constraints) and hence, just resorting to i32-->i64 pattern initially.

https://github.com/llvm/llvm-project/pull/165226