[PATCH] D82953: [ARM][MVE] Only tail-fold integer add reductions

Thu Jul 2 03:10:32 PDT 2020

SjoerdMeijer marked an inline comment as done.
SjoerdMeijer added inline comments.

================
Comment at: llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp:1337
+    }
+    if (I->getOpcode() != Instruction::Add) {
+      LLVM_DEBUG(dbgs() << "Only add reductions supported\n");
----------------
dmgreen wrote:
> samparker wrote:
> > SjoerdMeijer wrote:
> > > SjoerdMeijer wrote:
> > > > samparker wrote:
> > > > > dmgreen wrote:
> > > > > > I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet.
> > > > > > 
> > > > > > Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types.
> > > > > In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple reductions, so I don't think we have to worry about that at the moment. My basic example also didn't even vectorize a FP reduction loop. I think disabling all reductions would be too conservative now, we should have done it from the beginning but, that now we actually can handle something, I think we should try to have this cost function align with LowOverheadLoops implementation.
> > > > If we really don't want tail-folding, I think that should be controlled in other way than restricting this hook to do nothing. We have option `-prefer-predicate-over-epilog`, and that should do that for now. If source-code modifications and more fine grained control is possible, we even have a pragma.
> > > > 
> > > > Like Sam said, I would like to start with integer add reduction, so that we are also testing that part in the back-end.
> > > I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment: "this hook to do nothing" -> "this hook to do almost nothing".
> > > 
> > > I also wanted to add that in general matching up the behaviour here with the backend is probably best, and also that I could look into disallowing multiple reductions, but looks like we don't really have worry about that. 
> > > 
> > > And if there are some inefficiencies around integer add reductions, I guess that's fine at this stage (tail-folding can be disabled, see previous comment), and I think it's good to have these inefficiencies explicit and visible.
> > If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we want to check that:
> > - there's only one liveout.
> > - it's an add.
> > - and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test.
> > 
> > Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either.
> > 
> Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will help.
> 
> I would go with conservative for the time being to make turning on TailPred by default easier. We can always change it later, Plus D75069 will change the way integer reductions are generated anyway! I would think the backend pass is more useful for other reduction in the long run - but even then, like we talked about, it might be better to do things earlier.
> 
> But if you guys think that you can get it to work - then sounds good. I'll leave it to you. I personally wouldn't spend a lot of time on it though, if the goal is in the end to do it another way.
That would make most sense to me.

> - there's only one liveout.
> - it's an add.

Check. (well, almost, need to restrict it to 1 live-out).

> - and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test.

Will add this.

> Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either.

Yep, will add this. This should go in `canTailPredicateInstruction()`, perhaps I will do this in a little separate patch,

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82953/new/

https://reviews.llvm.org/D82953