[PATCH] D142359: [TTI][AArch64] Cost model vector INS instructions

Tue Jan 24 07:41:05 PST 2023

dmgreen added a comment.

There are some comments in https://reviews.llvm.org/D132185 about the last time this came up, although that was more aggressive than this.

A few points:

- I'm not a huge fan of making this cpu specific if we can avoid it. Most of the optimizations guides I have looked at going back to the A57 have similar latencies for the relative instruction, and making things very CPU specific increases the surface area for things to go wrong in a way that is not caught. There's not a big difference between neoverse cores and cortex-a cores in this regard, and if it's better we will likely want the advantages for -mcpu=generic too. I also think this might be more about how well the SLP vectorizer behaves than exactly how much instructions cost in the cpu.
- The default cost type in these functions (especially from vectorization) is TCK_RecipThroughput, not TCK_Latency. It is the relative throughputs that matters in most cases. (There can be times where critical path is important so it can be considered as part of the cost, but in general with a good out-of-order core or a well-scheduled in-order core, throughput it most pressing).
- This VectorInsertExtractBaseCost is really controlling the cost of inserting to a vector, extracting from a vector, or shuffling the vector around (either directly or through scalarization overhead). With vector shuffling or fp types this will be either "INS (ASIMD insert, element to element )" or "FMOV (FP move, register)". These two are usually quite efficient. With integer types it can involve a move between gpr and vpr, which comes under "FMOV (FP transfer, from gen to vec reg)" and "FMOV (FP transfer, from vec to gen reg)". These two have a throughput of 1 on all the cpus I looked at, which is a lot lower than other instructions. They often come in clumps too, so only being able to do 1 a cycle can be a real limiting factor.
- The VectorInsertExtractBaseCost is also a lever on how much you are willing to put up with the SLP vectorizer producing lots of shuffling. I got a report just yesterday about the SLP vectorizer producing slower code because of all the shuffling it introduced around a loop (which I think may be to do with the relative cost of FMA, but shows the low costs on vector inserts/extract can cause problems).
- The x264 improvement is probably https://reviews.llvm.org/D133441 where there are two mutual reductions on the same values (an add and a mla). With a lower vector extract cost it can vectorize one of the reductions with a lot of extracts, but then vectorize all the extracts with another in a second step. It doesn't actually have to pay the price of the extracts their though, it's a bit of a hack to get the same result. The other one I've seen is a function called hadamrad that I mentioned in the other ticket, but as far as I remember that uses ld1 instructions so the inserts would be expensive if D141602 <https://reviews.llvm.org/D141602> is applying.

I've also seen regressions in performance when running SPEC and other benchmarks in the past, and it has often out-weighed the benefits. Perhaps keeping the load-insert cost high will help there? I think working towards reducing this default from 3 to 2 is a good thing to do, I'm just unsure about whether it is really a better value yet.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142359/new/

https://reviews.llvm.org/D142359