[PATCH] D105432: [Analysis] Add simple cost model for strict (in-order) reductions

Tue Jul 6 03:48:43 PDT 2021

david-arm added inline comments.

================
Comment at: llvm/include/llvm/CodeGen/BasicTTIImpl.h:2087
+                                   TTI::TargetCostKind CostKind) {
+    auto *VTy = cast<FixedVectorType>(Ty);
+    InstructionCost Cost =
----------------
dmgreen wrote:
> Should this be checking for scalable vectors?
I don't think we can easily calculate a cost for scalable vectors here and is very target-specific. At the moment I have modelled this on the worst case, i.e. scalarising the operation, and for scalable vectors we don't know the number of elements. The approach taken here is similar to functions elsewhere in this file, e.g. getCommonMaskedMemoryOpCost. I think if we get here for scalable vectors it's actually a bug, since the target should really have dealt with this separately. Any thoughts @sdesmalen 

================
Comment at: llvm/include/llvm/CodeGen/BasicTTIImpl.h:2089
+    InstructionCost Cost =
+        getArithmeticInstrCost(Opcode, VTy->getElementType(), CostKind);
+    return Cost * VTy->getNumElements();
----------------
dmgreen wrote:
> I think this should this also include the cost of extracts (which will sometimes be free, but that is target specific). Probably using the ScalarizationOverhead.
Fair point. @sdesmalen suggested this too privately.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:136-137
+  /// when scalarizing an operation for a vector with ElementCount \p VF.
+  /// For scalable vectors this currently takes the most pessimistic view based
+  /// upon the maximum possible value for vscale.
+  unsigned getScalarizationCostFactor(ElementCount VF) const {
----------------
dmgreen wrote:
> I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases.
> 
> Can you speak to why the max VF is a better value to use?
> 
> I'm not sure I understand why this is scalarizing though.
Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :)

I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105432/new/

https://reviews.llvm.org/D105432