[PATCH] D53055: [MCA] Limit the number of bytes fetched per cycle.
    Clement Courbet via Phabricator via llvm-commits 
    llvm-commits at lists.llvm.org
       
    Fri Oct 12 05:09:44 PDT 2018
    
    
  
courbet added a comment.
In https://reviews.llvm.org/D53055#1263141, @andreadb wrote:
> In https://reviews.llvm.org/D53055#1263046, @courbet wrote:
>
> > In https://reviews.llvm.org/D53055#1262077, @andreadb wrote:
> >
> > > >> Now that llvm-mca is a library, people can define their own custom pipeline without having to modify the "default pipeline stages".
> > > >>  In particular, I don't want to introduce any frontend concepts in the default pipeline of llvm-mca.
> > > >>  For now, any frontend simulation should be implemented by stages that are not part of the default pipeline.
> >
> >
> > I don't have a particular opinion about whether this should be part of the default pipeline or not, but I think modeling the frontend is very important.
> >  This article <https://dl.acm.org/citation.cfm?id=2750392> from a couple years ago analyzes the typical workloads on a google datacenter. While most of the stalls are from the backend, the frontend has a significant contribution:
>
>
> I think that we are on the same page.
>  I have nothing against having frontend analysis: we want to be able to identify frontend bottlenecks.
>  My point was more about the "development process". I think we need to agree on a plan, and have a good roadmap. It is difficult to evaluate small incremental patches like this if we don't have a "vision". We should have at least an idea on what will be the next steps.
>
> > "Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions".
> > 
> > The authors found this trend to be increasing over time. The article shows that i-cache misses are a large part of these stalls, which is going to be hard to model statically.
> >  However, we also found out that large computation kernels typically had frontend stalls in fetch&decode due to the large size of vector instructions (the Intel fetch window is 16 bytes). We had some nice wins based on llvm_sim, which does simulate the frontend. We've made two of these wins public
> > 
> >   (https://github.com/webmproject/libwebp/commit/67748b41dbb21a43e88f2b6ddf6117f4338873a3, https://github.com/google/gemmlowp/pull/91). 
> >    
> > 
> > I think we agreed during EuroLLVM last year that we should standardize on a single tool to avoid duplicating effort, and that that this tool should be llvm-mca. That means that over time llvm-mca needs to grow to support more use cases.
> > 
> >> The default pipeline should only stay focused on simulating the hardware backend logic.
> >> 
> >>> I guess you meant to say: "the default pipeline should only stay focused on simulating what is modeled in the MCSchedModel" ? If we can carve out something that is common to all frontends, then it could end up in MCSchedModel, and then be in the default llvm-mca pipeline. (BTW the resource pointed to by Roman shows that the approach here might not be generic enough).
> >>>  Until then it feels like the flag is a low-cost approach to implementing this.
> >> 
> >> We shouldn't make any changes to the current FetchStage.
> >>  Changes to other stages of the default pipeline are okay as long as we can demonstrate that those are beneficial for all the simulated processors.
> > 
> > I think that's fair. How would you feel about moving the change in this patch into a separate stage ? The flag would then turn on adding the stage so that we can experiment with various CPUs. If that turns out to be useful, we can discuss adding fetch modeling to the MCSchedModel, which will then allow us to add this to the default pipeline in a principled way.
>
> I think that is the right way to go. It would unblock your work in the short term, and give us time to evaluate the quality of the new logic without affecting the default analysis pipeline.
>
> This is pretty much what I was suggesting in my previous comment (i.e. have frontend logic/process defined by separate stages that runs before "DispatchStage"). The current FetchStage will be renamed (I would do that after the conference if it is not a problem...), and it would still be the first stage to run. New stages would be marked as "experimental" to start. So that those are opt-in for subtargets.
>
> P.s.: the new stage should have the concept of cache line and alignment. So that we can experiment different alignment constraints for the input code block. My understanding is that this new Fetch stage models the interaction with an IC (instruction cache); processor models should be able to customize what portion of a cache line can be picked every cycle.
>
> Note however that this new stage may not be always enabled if the processor implements a loop buffer.
>  For example, instructions may be picked from a loop buffer, and not use the legacy decoders path (where instructions are fetched from the IC first). The throughput from the loop buffer normally differs from the throughput from the decoders, and it may be subject to different constraints (i.e. not the size in bytes of an instruction).
> So, I am curious to see how you plan to model those frontend aspects. We may want to have to separate simulations: one where we always assume the IC path; another where we assume instructions are always picked from a loop buffer.
>  In practice, the choice of whether opcodes are contributed by the legacy decoder's path or not depends on the feedback from the branch predictor, and the size of a code snippet. So, the question (not for this patch) is: how much we want to complicate the model? (that is why I was originally pushing for an RFC; I didn't mean to be annoying...). Should we care about modelling (at least a few) aspects of the branch predictor? We don't have to answer to these questions now; I just wanted to further clarify why I feel cautious when it comes to modelling the frontend logic.
I fully agree: The approach we took in llvm_sim is to tell the simulator whether we're in a loop or not with a flag <https://github.com/google/EXEgesis/blob/master/llvm_sim/x86/faucon.cc#L57>. In our implementation we always went through the legacy decoder because our goal was to improve scheduling of large blocks (because that's where rescheduling really makes a difference). And that's something that we could generalize upon: We can build a pipeline depending on the structure of the input (an interesting read BTW: https://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of).
Repository:
  rL LLVM
https://reviews.llvm.org/D53055
    
    
More information about the llvm-commits
mailing list