[PATCH] D53055: [MCA] Limit the number of bytes fetched per cycle.

Thu Oct 11 09:45:00 PDT 2018

andreadb added a comment.

In https://reviews.llvm.org/D53055#1261982, @courbet wrote:

> Hi Andrea,
>
> > There is already bug https://bugs.llvm.org/show_bug.cgi?id=36665, which is about adding support for simulating the hardware frontend logic.
> >  I know that @courbet  and his team would like to work on it. So, you can probably try to work with them on this.
> >  Unfortunately, that bugzilla must be updated. There is not enough information there (I suggested to send a detailed RFC upstream in case).
> > 
> > I strongly suggest you/your team/Clement's team to work together on that task. I am afraid that people may be working on the same tasks in parallel.. That has to be avoided.
> >  You can use that bugzilla to coordinate your work upsteam on this.
>
> Let me clarify this: Owen is working with us :) He has taken over the genetic scheduler work I presented at EuroLLVM. One of the bottlenecks we had was the frontend hence the change. I agree that this should have been made clearer (@owenrodley, can you create a bugzilla account and assign the bug to yourself ?)

Okay. Good to know that there is no overlap :-).

> 
> 
> In https://reviews.llvm.org/D53055#1260195, @andreadb wrote:
> 
>>
> 
> 
> 
> 
>> The default pipeline in llvm-mca doesn't simulate any hardware frontend logic.
>> 
>> The `Fetch Stage` in llvm-mca is only responsible for creating instructions and moving them to the next pipeline stage.
>>  It doesn't have to be confused with the Fetch logic in the hardware frontend, which - as you wrote - is responsible for fetching portions of a cache line every cycle, and feed them to the decoders via an instruction byte queue.
>> 
>> The llvm-mca Fetch stage is equivalent to an unbounded queue of already decoded instructions. Instructions from every iteration are immediately available at cycle 0.
> 
> All of this sounds more like a naming issue than an issue about what Owen is trying to implement.
>  Maybe we could rename `FetchState` into `DecodedStage` or something like this ?

The "FetchStage" is literally just there to create instructions and move them to the next stage.

It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.

Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
We may revert to that name if you like.

> 
> 
>> Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
>>  In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
>>  For now, any frontend simulation should be implemented by stages that are not part of the default pipeline. The default pipeline should only stay focused on simulating the hardware backend logic.
> 
> I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
>  Until then it feels like the flag is a low-cost approach to implementing this.

We shouldn't make any changes to the current FetchStage.
Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.

Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.

If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).

-

About this patch:
the number of bytes fetched is not meaningful for the current "FetchStage".
The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now).
You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

Repository:
  rL LLVM

https://reviews.llvm.org/D53055