[PATCH] [AArch64] Refines the Cortex-A57 Machine Model

Wed Sep 17 09:07:53 PDT 2014

> Setting IssueWidth=3 is correct. That really means how many micro-ops can be "handled" per cycle. So it should be the minimum of decode/issue width. To be precise, we should have a decodeWidth that counts instructions, but I never bothered to add it since IssueWidth can serve the same purpose.

Thanks for the clarification.

> MicroOpBufferSize determines in-order modeling of latency. It's your machine, so if you want to model it as in-order and get better results, then I can't argue!
> 
> You could go even further and model the in-order stalls on functional units that are not fully pipelined by setting BufferSize=0.
> Note that you can have a mix of in-order/out-of-order resources if you choose.

I figured there was some tradeoffs with modeling purely in-order, but the gains were so broadly beneficial that it was a no brainer. I really want to do just this and model both the in-order and out-of-order portions of the pipelines for each instructions. It wasn't immediately obvious how to do it, so I temporarily shelved the idea. Might be a nice experiment for a proposed SchedMachineModel tutorial. :)

> You can also model just a certain class of instructions as having in-order latency by boosting MicroOpBufferSize and setting BufferSize=1. You can have a class of instructions consume multiple resources so you could model both in-order resource contention and latency.
> 
> Note that the idea behind modeling out-of-order is that we don't want an instruction issue limitation to be modeled as a hard stall that preempts all other heuristics. There are thresholds and heuristics that then come into play to try to balance resources. However, the default heuristics are very conservative, in the sense that the schedule is preserved unless we suspect a real stall (first do no harm). Given the scheduler only sees a single block, it often doesn't do anything to improve issue bandwidth on an aggressive OOO model. The scheduler could be improved by recognizing loops, inferring a steady cpu state and adjusting heuristics. I've added some loop awareness to the heuristics but it could be much better.

I really like this idea of adjusting heuristics. Think this is something that PGO can also help with?

> Since you have plenty of registers, scheduling in-order probably doesn't often hurt and is occasionally useful depending on how effective the hardware is at balancing instruction dispatch. You'll probably see a lot of unnecessary shuffling with in-order scheduling, but if you get better performance, then it's worth it.
> 
> One thing you will notice is that interdependent instructions will no longer be scheduled in the same 3-wide decoding group. Since we're not inserting nops, it's probably not a big deal though.

Thanks again for all of the clarification, Andy.

http://reviews.llvm.org/D5372