[PATCH] D89693: [AArch64] Favor post-increments and implement TTI::getPreferredAddressingMode

Fri Feb 19 09:41:06 PST 2021

SjoerdMeijer added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1287
+  InductionDescriptor IndDesc;
+  if (!L->getInductionDescriptor(*SE, IndDesc))
+    return TTI::AMK_None;
----------------
fhahn wrote:
> SjoerdMeijer wrote:
> > fhahn wrote:
> > > What if the loop has multiple induction phis and one of those has a large constant step or  unknown step? Should we instead iterate over all phis and check all induction phis (using `InductionDescriptor::isInductionPHI` to identify them)?
> > Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I think we are interested in what is called the `PrimaryInduction` in the loop vectoriser, i.e. the one that actually controls the loop. And this seems to match exactly with what `getInductionVariable()` promises to return, which is used by `getInductionDescriptor`. That's why this looked okay to me...
> I am not sure I completely understand why the IV that controls the loop is special when it comes to picking the addressing mode for the loop? As : D97050 indicates, `getInductionVariable` is quite brittle and probably misses additional cases, so if we can avoid using it the code should be more robust.
> 
> If we have multiple IVs, would we not be interested if we can use post-increments for all of the ones that access memory? You could have a loop with 2 IVs, one to control the loop and one to access memory, like below. If we use the offset of the IV controlling the loop, post-index is profitable, but there won't be any accesses for that variable. (in this example, the inductions can probably be simplified to a single one, but it keeps things simple)
> 
> ```
> int I=0, J=0;
> 
> while (I != N) {
>   Ptr[J] = 0;
>   I++;
>  J += 2000;
> }
> ```
Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a more fine-grained (precise) heuristic.

I would need to give this some more thoughts, but the current heuristic is based on the distinction between unrolled loops not not unrolled loops (which is why I use the primary IV) which is a simple heuristic that seems to work. In general, I think this is quite a difficult problem, given that there are a few addressing modes and potentially quite some inductions to analyse. This will come at a cost, and it's unclear at the point if it will improve results. 

I have verified that the implementation in D89894 works (for runtime unrolled loops), but that indeed reveals an inefficiency (missed opportunity) in the load store optimiser as noted there, which means we can't use that yet for enabling pre-indexed accesses. But perhaps I can use that heuristic, which does a bit more analysis, to decide when *not* to generate pre-indexed but post-indexed accesses. 

But this slightly improved heuristic in D89894 may still not be precise enough for your liking... I will try to experiment a little bit with that, but at the moment  I tend to think that this is a step forward, and this could be improved when we find the need for that?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D89693/new/

https://reviews.llvm.org/D89693