[PATCH] D53055: [MCA] Limit the number of bytes fetched per cycle.

Fri Oct 12 00:50:49 PDT 2018

courbet added a comment.

In https://reviews.llvm.org/D53055#1262077, @andreadb wrote:

> The "FetchStage" is literally just there to create instructions and move them to the next stage.
>
> It is just an artificial entrypoint stage that acts as a "data source" (where data is instructions). It doesn't try to model any hardware frontend concepts.
>
> Its original name was "InstructionFormationStage". We ended up calling it "FetchStage"; in retrospect, it was not a good name because it causes ambiguity.
>  We may revert to that name if you like.

Yes, I think that would make sense.

>>> Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
>>>  In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
>>>  For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.

I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
This article <https://dl.acm.org/citation.cfm?id=2750392> from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:

"Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".

The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
 (https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91).

I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.

> The default pipeline should only stay focused on simulating the hardware backend logic.
> 
>> I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
>>  Until then it feels like the flag is a low-cost approach to implementing this.
> 
> We shouldn't make any changes to the current FetchStage.
>  Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.

I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.

> Any other big change should require an RFC. The introduction of a new stage in the default pipeline should also be justified as it affects both simulation time, and potentially the quality of the analysis.
> 
> Essentially, I want to see an RFC on how your team wants to model the frontend simulation.
>  There are too many aspects that cannot be accurately modelled by a static analysis tool: branch prediction, loop buffers, different decoding paths, decoders with different capabilities, instruction byte windows, instruction decoder's queue,etc.
>  If we want to do that as part of the default pipeline, then we have to be extremely careful and do it right.
> 
> If we don't describe the hardware frontend correctly, we risk to pessimize the analysis rather than improve it. If we decide to add it to the default pipeline, then - at least to start - it should be opt-in for the targets (stages are not added unless the scheduling model for that subtarget explicitly ask for them).
> 
> - About this patch: the number of bytes fetched is not meaningful for the current "FetchStage". The "FetchStage" doesn't/shouldn't care about how many bytes an instruction has. More importantly, our "intermediate form" is an already decoded instruction; the whole idea of checking how many bytes an instruction is at that stage is odd. We don't simulate the hardware frontend in the default pipeline (at least, not for now). You need a separate stage for that. So, for now, sorry but I don't think it is a good compromise.

Repository:
  rL LLVM

https://reviews.llvm.org/D53055