[llvm] [RISCV] Account for factor in interleave memory op costs (PR #111511)

Fri Oct 11 13:00:12 PDT 2024

preames wrote:

I got curious, and went and did a full set of Factor vs SEW vs LMUL sweeps for the segmented loads on the BP3.  You can find all the data here: https://github.com/preames/bp3-microarch/#vlseg_lmul_x_sew_throughput

Overall, my data confirms the snippets that Luke has posted above.  Let me summarize what I think is going on here.

BP3 appears to have two implementations - one used for factors 2,3,4 and the other for factors 5,6,7,8.  The first appears to be a wide load followed by some kind of shuffle operation, whereas the second appears to scale with the number of 128 b loads required + some kind of adjustment term.  I'm having trouble fitting a good formula for either to be honest.  

Given these results, and what Craig has noted above about the x280, I think we need to have distinct costing models for different factors.  Annoyingly, it looks like that threshold may need to differ by processor as well.  

https://github.com/llvm/llvm-project/pull/111511