[PATCH] D39976: [AArch64] Query the target when folding loads and stores

Fri Mar 16 11:00:23 PDT 2018

gberry added a comment.

In https://reviews.llvm.org/D39976#1027478, @evandro wrote:

> In https://reviews.llvm.org/D39976#1027475, @gberry wrote:
>
> > In https://reviews.llvm.org/D39976#1017864, @evandro wrote:
> >
> > > Methinks that the gist is to move away from features and to rely more on the cost model.  In the case of this patch, it also removes the feature `FeatureSlowPaired128` in https://reviews.llvm.org/D40107.
> >
> >
> > That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that.  If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget.  Is there any net effect of this change combined with https://reviews.llvm.org/D40107?
>
>
> `FeatureSlowPaired128` was just too coarse.  The alternative would be to change it to something more specific, like `FeatureSlowSomePaired128Sometimes`, and then create yet another when for the next generation to specialize it further.  Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now.    'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores.  But this change prevents some load/store opcodes from having their base register increment folded in.  The two seem unrelated.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context.  Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

In https://reviews.llvm.org/D39976#1031443, @evandro wrote:

> In https://reviews.llvm.org/D39976#998081, @gberry wrote:
>
> > I've thought about this some more and tested it out on Falkor.  As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.
>
>
> Were a test with `-mcpu=falkor` added to `llvm/test/CodeGen/AArch64/ldst-opt.ll`, how should the checks look like?

I would expect this change to not change the code generated for Falkor, so whatever is generated currently

In https://reviews.llvm.org/D39976#999520, @evandro wrote:

> In https://reviews.llvm.org/D39976#998081, @gberry wrote:
>
> > I've thought about this some more and tested it out on Falkor.  As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.
>
>
> I see that they're modeled with a latency of 0 cycles and 4 uops.  Are the units they need, ST and VSD, really used for 0 cycles?

That's not how I understand the scheduling model to work.  Resources are used for 'ResourceCycles' cycles (which defaults to 1), so in this case ST and VSD are used for 1 cycle.  The Latency of the store is set to 0, since it doesn't write a register, so the latency doesn't mean anything as far as I can tell.  The pre/post increment version has a Latency of 3 on the first def index (i.e. the updated base register) since that is the latency of reading this new register value.

https://reviews.llvm.org/D39976