[llvm-dev] [llvm-mca] Resource consumption of ProcResGroups

Andrew Trick via llvm-dev llvm-dev at lists.llvm.org
Sun May 10 06:32:47 PDT 2020



> On May 9, 2020, at 5:12 PM, Andrea Di Biagio via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> The llvm scheduling model is quite simple and doesn't allow mca to accurately simulate the execution of individual uOPs. That limitation is sort-of acceptable if you consider how the scheduling model framework was originally designed with a different goal in mind (i.e. machine scheduling). The lack of expressiveness of the llvm scheduling model unfortunately limits the accuracy of llvm-mca: we know the number of uOPs of an instruction. However we don't know which resources are consumed by which micro-opcodes. So we cannot accurately simulate the independent execution of individual opcodes of an instruction.
> 
> Another "problem" is that it is not possible to describe when uOPs effectively start consuming resources. At the moment, the expectation is that resource consumption always starts at relative cycle #0 (relative to the instruction issue cycle).
> Example: an horizontal add on x86 is usually decoded into a pair of shuffles uOPs and a single (data-dependent) vector ADD uOP.
> The ADD uOP doesn't execute immediately because it needs to wait for the other two shuffle uOPs. It means that the ALU pipe is still available at relative cycle #0 and it is only consumed starting from relative cycle #1 (ssuming that both shuffles can start execution at relative cycle #0). In practice, the llvm scheduling model only allows us to declare which pipeline resources are consumed, and for how long (in number cycles). So we cannot accurately describe to mca that the delayed consumption of the ALU pipe.
> Now think about what happens if: the first shuffle uOP consumes 1cy of HWPort0, and the second shuffle uOp consumes 1cy of HWPort1, and the ADD consumes 1cy of HWPort01. We end up in that "odd" situation you described where HWPort01 is "reserved" for 1cy.
> In reality, that 1cy of HWPort01 should have started 1cy after the other two opcodes. At that point, both pipelines would have been seen available.
> 
> In conclusion, the presence of a "reserved" flag is not ideal, but it is sort-of a consequence of the above mentioned two limitations (plus the way how the Haswell and Broadwell models were originally designed).
> 
> I hope it helps,
> -Andrea

Food for thought...

It would be easy to add a DelayCycles vector to SchedWriteRes to indicate the relative start cycle for each reserved resource. That would effectively model dependent uOps.

NumMicroOps is only meant to model any general limitation of the cpu frontend to issue/rename/retire micro-ops. So, yes, there's no way to associate resources with specific uOps. You can mark any kind of resource as "dynamically scheduled" (BufferSize = -1). If an instruction uses different kinds of dynamic resources, then those need not be reserved at the same time. If we had the DelayCycles vector, it could be interpreted as "this resource must be reserved N cycles after prior reservations of other resources".

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200510/6b796ab4/attachment.html>


More information about the llvm-dev mailing list