[PATCH] D75214: [MCA][WIP] Decoder stage (PR42202)

Andrea Di Biagio via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Feb 27 04:30:35 PST 2020


andreadb added a comment.

Hi Roman,

I think that we should further discuss this design in an RFC or on the bugzilla.

For now, I consider this patch an interesting prototype (which presumably works for bdver2). However, a proper design will have to be more generic, and it would require more details. How many more details are required really depends on how accurate the simulation should be.

In my opinion, processor models should be able to describe how decoders work via tablegen.
For example, target should be able to declare:

- the number of available decoders
- the features of each decoder
  - The "maximum number of bytes" that a decoder can peek from a byte window during a cycle).
  - How many uOp can be generated in a cycle; etc.

Depending on how accurate we want to be, we may also need to model some properties of (what AMD calls) the "Instruction Byte Buffer" (IBB).
An accurate simulation requires that the decoder stage keeps track of which instruction byte window is active during a cycle, and which byte offset should be used by decoders (that is the offset from the last successful decoded instruction). Without that knowledge we lose some accuracy (i.e. we don't accurately model the throughput from the decoders).

If we decide that we don't want to go to that level of details, we still need to keep into account that processors may implement loop caches.
MCA should allow users to specify whether they want to simulate fetches from the instruction cache or from a hardware loop buffer (if available at decoding stage). The latter would provide a different throughput, and it would also be subject to different limitations than the decoders. I understand that this may not be useful for bdver2 (or btver2 FWIW). However, it would be useful for pretty much all modern Intel processors, and Zen.

The assumption that microcoded instructions always decode to more than 2 uOPs is a good default assumption. However, it would be nicer if processor models were able to override that quantity.

P.s.: if you want to accurately model frontend stalls caused by backpressure, then you need to use your pass in conjunction with the "MicroOpQueueStage" stage.

As a side note (not related to this patch). In terms of overall simulation: if we start adding more stages then we should consider at some point whether to increase the number of default iterations.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75214/new/

https://reviews.llvm.org/D75214





More information about the llvm-commits mailing list