[PATCH] D117003: [SchedModels][CortexA55] Add ASIMD integer instructioins

Thu Feb 10 02:15:14 PST 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA55.td:494
+// COPY
+def : InstRW<[CortexA55WriteCOPY], (instrs COPY)>;
 }
----------------
kpdev42 wrote:
> dmgreen wrote:
> > Does this add a lot? It's not really how COPYs work.
> According to our experiments FPU copy (fmov) has latency of 1 cycle and throughput of 2 or 1 (Q-form). According to model integer ALU copy has 3 cycle latency. What would be correct model for COPY in your opinion?
Yep - the vector mov latency and throughput sound good to me.

The issue is that a COPY is that post-ra scheduling they won't exist, they will already have been turned into either movs or removed because they were not needed. And pre-RA it is difficult to know if they will be deleted later, if they are just no-op copys. The assumption in a lot of places will be that they will be removed, so adding any scheduling info to them about resources can be incorrect.

Cross register bank copies can be more important, and won't be removed as easily. Those are the ones that transfer between gpr and fpr.

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Cortex/A55-neon-instructions.s:2506
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -     2.00    -      -     ld4r	{ v0.2s, v1.2s, v2.2s, v3.2s }, [sp], x30
-# CHECK-NEXT:  -      -      -      -     0.50   0.50    -      -      -      -      -      -     mla	v0.8b, v0.8b, v0.8b
-# CHECK-NEXT:  -      -      -      -     0.50   0.50    -      -      -      -      -      -     mls	v0.4h, v0.4h, v0.4h
+# CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -     mla	v0.8b, v0.8b, v0.8b
+# CHECK-NEXT:  -      -      -      -      -      -      -     0.50   0.50    -      -      -     mls	v0.4h, v0.4h, v0.4h
----------------
kpdev42 wrote:
> dmgreen wrote:
> > What is the reasoning for the integer multiplies going down the FPMAC pipeline?
> I guess mla/mls (ASIMD multiply/accumulate) utilize NEON pipeline. For some reason 2 NEON pipelines of Cortex-A55 are modelled with 5 pipelines (2 x FPALU, 2 x FPMAC, 1 x FPDIV). What you think would be correct resource assignment for mla/mls?
I'm not entirely sure either way, to be honest. A lot of this has been around from long ago.

>From what I can tell, the FPMAC is for floating point operations that are expected to take a long time (the ones that finish out of order in the optimization guide). There are 2 because of the way it splits 128bit operations into 2 64bit operations, and so that models the dual-issue. I'm not sure what FPDIV is. It models the hazards in fsqrt/fdiv maybe?

So I don't think that the integer mla's need to go onto the same FPMAC pipeline. They can go onto into FPALU I think (or maybe it doesn't matter which they go down, but FPALU sounds more correct to me).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117003/new/

https://reviews.llvm.org/D117003