[PATCH] D39976: [AArch64] Query the target when folding loads and stores
Geoff Berry via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Mar 16 11:00:23 PDT 2018
gberry added a comment.
In https://reviews.llvm.org/D39976#1027478, @evandro wrote:
> In https://reviews.llvm.org/D39976#1027475, @gberry wrote:
>
> > In https://reviews.llvm.org/D39976#1017864, @evandro wrote:
> >
> > > Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature `FeatureSlowPaired128` in https://reviews.llvm.org/D40107.
> >
> >
> > That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with https://reviews.llvm.org/D40107?
>
>
> `FeatureSlowPaired128` was just too coarse. The alternative would be to change it to something more specific, like `FeatureSlowSomePaired128Sometimes`, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.
I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.
I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.
In https://reviews.llvm.org/D39976#1031443, @evandro wrote:
> In https://reviews.llvm.org/D39976#998081, @gberry wrote:
>
> > I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.
>
>
> Were a test with `-mcpu=falkor` added to `llvm/test/CodeGen/AArch64/ldst-opt.ll`, how should the checks look like?
I would expect this change to not change the code generated for Falkor, so whatever is generated currently
In https://reviews.llvm.org/D39976#999520, @evandro wrote:
> In https://reviews.llvm.org/D39976#998081, @gberry wrote:
>
> > I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.
>
>
> I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?
That's not how I understand the scheduling model to work. Resources are used for 'ResourceCycles' cycles (which defaults to 1), so in this case ST and VSD are used for 1 cycle. The Latency of the store is set to 0, since it doesn't write a register, so the latency doesn't mean anything as far as I can tell. The pre/post increment version has a Latency of 3 on the first def index (i.e. the updated base register) since that is the latency of reading this new register value.
https://reviews.llvm.org/D39976
More information about the llvm-commits
mailing list