[PATCH] D105432: [Analysis] Add simple cost model for strict (in-order) reductions

Fri Jul 9 00:50:21 PDT 2021

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:136-137
+  /// when scalarizing an operation for a vector with ElementCount \p VF.
+  /// For scalable vectors this currently takes the most pessimistic view based
+  /// upon the maximum possible value for vscale.
+  unsigned getScalarizationCostFactor(ElementCount VF) const {
----------------
dmgreen wrote:
> david-arm wrote:
> > dmgreen wrote:
> > > david-arm wrote:
> > > > dmgreen wrote:
> > > > > I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases.
> > > > > 
> > > > > Can you speak to why the max VF is a better value to use?
> > > > > 
> > > > > I'm not sure I understand why this is scalarizing though.
> > > > Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :)
> > > > 
> > > > I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases?
> > > > 
> > > OK - more work is needed. Got it. I would have expected these cost factors to come from the subtarget, not an IR attribute.
> > > 
> > > What is being scalarized here though? From https://godbolt.org/z/fcz71dPeY for example? Some of the Illegal types were hitting errors.
> > Even though there is a single faddv instruction I think for now it still makes sense to model this as being scalarised because conceptually the lanewise FP additions still have to be done in sequence, rather than tree-based.
> Then what does the scalarization?
> https://godbolt.org/z/hfeaYh8r8
> TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".
> 
> Or do you mean that the fadda will have a low throughput?
The idea is that an fadda will have a low throughput because the operation is conceptually scalarized, because the fadd's can't be performed in parallel i.e.

  double result = ((((init + v0) + v1) + v2) + ...) + vn; // where v0 .. vn are the lanes of the vector

Perhaps this is more a latency than a 'throughput' issue, but if an operation has a very long latency and blocks one of the functional units, I guess that has an impact the throughput as well.

The more important thing for now is that we want to have some conservative cost value for these, so that we don't assume in-order/in-loop reductions are cheap, so that we can tune it to return more sensible values later when we can experiment with this (after all, scalable auto-vec isn't fully functional yet). The other thing we're planning to improve is that when targeting a specific CPU, `getMaxVScale` returns the values from the max-vscale attribute in the IR, so that this cost-function no longer assumes the worst-case cost, but rather a more realistic cost based on the targeted vector length. The current implementation doesn't do this yet, but that's on our active to-do list.

> https://godbolt.org/z/hfeaYh8r8
> TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".
`getArithmeticReductionCostSVE` already returns `Invalid` for the multiplication, but only for the unordered reductions. This now doesn't happen for the ordered case. @david-arm can you look into that?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105432/new/

https://reviews.llvm.org/D105432