[llvm] [VPlan] Consider address computation cost in VPInterleaveRecipe. (PR #148808)

Mon Jul 21 07:32:48 PDT 2025

rj-jesus wrote:

Hi @fhahn, @Mel-Chen, I'm sorry for the late reply.

> As far as I know, SVE only supports interleave factors from 2 to 4, while this case uses a factor of 5. I suspect that's the real cause of the regression.

Perhaps I misunderstood, but I don't think this is SVE-related as the loop is vectorised with fixed-length vectors?

I agree that, in a vacuum, it might not make sense to consider the address computations as @fhahn mentioned. The problem is that, at least as far as I understand it, currently this is not done consistently when costing different memory operations. For example, please consider the input above (https://godbolt.org/z/j6a5ofvo5). For the scalar loop, we'll have the following costs:
```
LV: Computing best VF using cost kind: Reciprocal Throughput
LV: Found an estimated cost of 0 for VF 1 For instruction:   %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx1 = getelementptr inbounds nuw float, ptr %b, i64 %indvars.iv, !dbg !36
LV: Found an estimated cost of 2 for VF 1 For instruction:   %1 = load float, ptr %arrayidx1, align 4, !dbg !36, !tbaa !27
...
LV: Scalar loop costs: 51.
```
The cost of every scalar load/store is inflated by 1 (the address computation cost), and there are 10 loads and 5 stores. This means that 15/51 ≈ 29% of the total estimated cost of the loop is due to address computations. Based on the corresponding generated code, this is inaccurate, but more importantly, this penalises the scalar version significantly compared to the interleaved version. As @david-arm mentioned, since other memory recipes also account for the address costs (whether that's sensible or not), I thought it would be reasonable to make the interleave recipe do so as well.

> It may be the case that the cost of the interleave group itself may not be accurate and TTI may need to be updated for Grace?

It might be, but as far as I can see, Apple CPUs compute the same costs for the interleave group as the Neoverse CPUs. The difference for Apple CPUs that helps them prefer the scalar version seems to be that, intentionally or not, they default to `prefersVectorizedAddressing() = false` ([link](https://github.com/llvm/llvm-project/blob/3371b9111f26dc758f68c6691e24200cf86a8b74/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp#L4410)). This increases the cost of the interleaved version by one---just enough to make the scalar version preferable.

I can have a look at making this or some other similar change Grace-specific, but it does seem there's a mismatch here in how we consider the address costs in different recipes. What do you think?

https://github.com/llvm/llvm-project/pull/148808