[PATCH] D105432: [Analysis] Add simple cost model for strict (in-order) reductions

Wed Jul 7 09:51:41 PDT 2021

dmgreen added inline comments.

================
Comment at: llvm/include/llvm/Analysis/TargetTransformInfo.h:1170

+  enum class ReductionType : uint8_t { Split, Ordered };
+
----------------
sdesmalen wrote:
> RKSimon wrote:
> > Please can you add doxygen description comments for these?
> nit: may I request this (and any other references to it) to be renamed to TreeWise? In my experience that's a more common term used with these kinds of reductions.
Could we represent this by passing FastMathFlags, not a new enum? The fast math flags are what dictates in-order vs relaxed for example. And I don't believe that an integer "Ordered" reduction is ever different from a un-ordered reduction.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h:136-137
+  /// when scalarizing an operation for a vector with ElementCount \p VF.
+  /// For scalable vectors this currently takes the most pessimistic view based
+  /// upon the maximum possible value for vscale.
+  unsigned getScalarizationCostFactor(ElementCount VF) const {
----------------
david-arm wrote:
> dmgreen wrote:
> > david-arm wrote:
> > > dmgreen wrote:
> > > > I had assumed (without thinking about it very much) that the costs for VF arguments would be based either the exact value of VF from the -mcpu argument if one is provided. If it is not then we would guess at a value, probably VF=2 would be a sensible default. This is using the maximum possible VF, which sounds like a large over-estimate in most cases.
> > > > 
> > > > Can you speak to why the max VF is a better value to use?
> > > > 
> > > > I'm not sure I understand why this is scalarizing though.
> > > Hi @dmgreen, yeah we are aware of this problem. It's not ideal - at the moment we also do this for gather-scatters too. We took the decision to be conservative and use the maximum value of vscale as the worst case scenario. In practice, the runtime value could vary from machine to machine and we thought it better to wait a while and revisit this again at some point. In fact, that's partly why I created this function so that we only have to change one place in future. :)
> > > 
> > > I also think that always choosing the most optimistic case could lead to poor performance so we have to be careful. One option we have is to use the new vscale range IR attributes to refine this, or choose a value of vscale that represents some sort of average of real use cases?
> > > 
> > OK - more work is needed. Got it. I would have expected these cost factors to come from the subtarget, not an IR attribute.
> > 
> > What is being scalarized here though? From https://godbolt.org/z/fcz71dPeY for example? Some of the Illegal types were hitting errors.
> Even though there is a single faddv instruction I think for now it still makes sense to model this as being scalarised because conceptually the lanewise FP additions still have to be done in sequence, rather than tree-based.
Then what does the scalarization?
https://godbolt.org/z/hfeaYh8r8
TargetLowering::expandVecReduce doesn't appear to handle it, which would imply to me that the cost should be "Invalid".

Or do you mean that the fadda will have a low throughput?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105432/new/

https://reviews.llvm.org/D105432