[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Fri Mar 2 06:42:29 PST 2018

Hi Andrew,

Thanks for the feedback!

On Fri, Mar 2, 2018 at 1:16 AM, Andrew Trick <atrick at apple.com> wrote:

>
> On Mar 1, 2018, at 9:22 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named
> llvm-mca. We
> currently use it internally to statically measure the performance of code,
> and
> to help triage potential problems with target scheduling models.  We
> decided to
> post this RFC because we are interested in the feedback from the
> community, and
> we also believe that other people might be interested in a tool like this.
>
>
> This is a dream come true!
>

> llvm-mca uses information which is already available in LLVM (e.g.
> scheduling
> models) to statically measure the performance of machine code in a
> specific cpu.
> Performance is measured in terms of throughput as well as processor
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of the
> code
> when run on the target, but also help with diagnosing potential performance
> issues.
>
>
> You can’t really have one without the other, which is why this is a dream
> come true.
>
> Given an assembly code sequence, llvm-mca estimates the IPC (instructions
> per
> cycle), as well as hardware resources pressure. The analysis and reporting
> style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able to
> generate
> a detailed report which should help with identifying and analyzing sources
> of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to
> obtain
> read-advance information, and understand how processor resources are used
> by
> instructions.  By design, the quality of the performance analysis
> conducted by
> the tool is inevitably affected by the quality of the target scheduling
> models.
>
> However, scheduling models intentionally do not describe all processors
> details,
> since the goal is just to enable the scheduling of machine instructions
> during
> compilation. That means, there are processor details which are not
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
>
> The LLVM machine model can have as much detail as we want, as long as it’s
> opt-in for targets.
> We’ve always had more detail than the generic scheduler actually needed.
> e.g. the scheduler doesn’t currently care how big the per-resource buffer
> sizes are.
> Targets have their own scheduling strategies that can pick and choose.
>
> At one point I wrote code that could be called by the scheduler to
> simulate OOO execution at much greater detail, but the benefit didn’t
> justify the complexity and compile time. I always felt that a static
> analysis tool was the right place for this kind of simulation.
>

I agree. I am pretty confident that all the extra details can become opt-in
for targets.

> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>
>
> MicroOpBufferSize is presumed to cover register renaming and retirement,
> assuming they are well-balanced. For your tool, you certainly want to be
> more precise.
>

Yes. The long term goal is to have specific (optional) fields for targets
that want to specify a different value. MicroOpBufferSize could still be
used as the default value in the absence of extra information.

>  - Actual dispatch width (it often differs from the issue width).
>
>
> This was always a hard one to generalize in a machine independent way, and
> half the world seems to swap the meaning of those terms. If you’re going to
> make this distinction please define the terms clearly in the machine model
> and explain how tools are expected to use the information.
>
> Currently IssueWidth is used to tell the scheduler that ’N' microops will
> definitely take ’N’ / ‘IssueWidth’ cycles regardless of which functional
> units or dispatch pipeline is involved.
>
> [Reading below, I saw that you define instruction “dispatch" as what the
> LLVM machine model calls “issue”. LLVM doesn’t model dispatch directly
> because, by its definition, it is redundant with the number of function
> units, modulo some dynamic behavior that can’t be statically predicted
> anyway]
>
> We also don’t model decoding, because, presumably you hit the micro-op
> limit first.
>
>  - Number of temporary registers available for renaming.
>
>
> See MicroOpBufferSize above.
>

Right. This can be another detail that targets could expose.

>  - Number of read/write ports in the register file(s).
>
>
> The assumption is that we hit micro-op issue width first. I suppose it’s
> good to have though.
>

>  - Length of the load/store queue in the LSUnit.
>
>
> That was supposed to be covered by per-processor-resource buffer size.
> Maybe you want to simulate the load/store queue differently from other
> functional units? e.g. one shared queue across multiple load store units?
>

Yes. To start, I'd like to be able to have a unified LSUnit. In future,
this design could be improved.

> It is also very difficult to find a "good" abstract model to describe the
> behavior of out-of-order processors. So, we have to keep in mind that all
> of
> these aspects are going to affect the quality of the static analysis
> performed
> by the tool.
>
>
> Like the scheduler, the tool should be extensible so that different
> targets can simulate behavior that doesn’t fit some abstract model.
>

This has been mentioned by Clement too. The long term goal is to make it
possible for targets to specify what model they want. As Clement pointed
out in the code review, this will require a few inital changes to make the
tool more modular in preparation for it. I suggested in the review if it is
possible to use the current design as a baseline, and improve it
incrementally with later patches. Essentially, commit the current baseline
design and then start changing/improving it with other patches would make
it easier for others to contribute their own patches.

>
> An extensive list of known limitations is reported in one of the last
> sections
> of this document. There is also a section related to design problems which
> must
> be addressed (hopefully with the help of the community).  At the moment,
> the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have also
> tested it on other targets to check how generic the tool is. In our
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number of
> micro
> opcodes specified for an instruction", or "wrong set of hardware
> resources".
> Some of these mistakes are quite common (sometimes on mature models too),
> and
> often difficult to catch.  Reports generated by this tool are simple to
> analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ——————————
>
>
> Nice documentation. Please find a good place for it to live and include it
> in your patch.
>

I can copy/paste this RFC into a README.txt (removing the initial paragraph
and probably the conclusions). :-)

> <snip>
>
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can be
> enabled using the command line flag '-timeline'.  This prints an extra
> section
> in the report which contains the so-called "timeline view".  Below is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Truly awesome.
>

Thanks. :-)

>
>
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters for the
> dispatch logic, the reorder buffer, the retire control unit and the
> register
> file.
>
> Below is an example of verbose output generated by the tool for the
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
>
> So great!
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally
> committed).
>
> The tool only models the out-of-order portion of a processor. Therefore,
> the
> instruction fetch and decode stages are not modeled. Performance
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't know
> anything about branch prediction.
>
>
> The following sections are fantastic documentation for your tool and the
> machine model in general. I hope you check all this in somewhere.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order from a
> queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit in
> file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field 'IssueWidth'
> from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag "-dispatch=<N>"
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
>
> This is really what’s meant by LLVM’s IssueWidth. I think Intel always
> preferred to call it “dispatch”. Presumably, instructions are buffered
> before being dispatched/issued to functional units, so LLVM’s “IssueWidth”
> is meant to model a hardware restriction that is independent from the
> number of functional units (processor resources).
>
> It’s especially confusing because some µArchs have dispatch pipelines
> decoupled from the functional units.
>

I know... I guess, as long as we standardize the terminology, things should
be fine.

> Scheduling models don't describe register files, and therefore the tool
> doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
>
> The LLVM machine model can model the size of the register renaming pool if
> you like.
>
> By default, the tool (optimistically) assumes a single register file with
> an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A value
> of
> zero for N means 'unbounded'.  Knowing how many temporaries are available
> for
> register renaming, the tool can predict dispatch stalls caused by the lack
> of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction depends on
> the
> number of micro-opcodes it specifies in the target scheduling model (see
> field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its derived
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see
> Dispatch.h).
> Its goal is to track the progress of instructions that are "in-flight", and
> retire instructions in program order.  The number of entries in the reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are
> treated
> like "scheduler" resources, and the field 'BufferSize' (from the processor
> resource tablegen definition) defines the size of the scheduler's queue.
>
> Zero latency instructions (for example NOP instructions) don't consume
> scheduler
> resources.  However, those instructions still reserve a number of slots in
> the
> reorder buffer.
>
>
> As currently modeled by dummy µOps:
>
> // A single nop micro-op (uX).
> def WriteX : SchedWriteRes<[]> { let Latency = 0; }
>
> You’ll need MachineInstr’s though.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource implements a
> queue
> of instructions.  An instruction has to wait in the scheduler's queue until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple
> processor
> schedulers.  A Scheduler is responsible for tracking data dependencies, and
> dynamically select which processor resources are consumed/used by
> instructions.
>
> Internally, the Scheduler class delegates the management of processor
> resource
> units and resource groups to the ResourceManager class.  ResourceManager
> is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a resource
> group,
> the ResourceManager object selects one of the available units from the
> group; by
> default, it uses a round-robin selector to guarantee that resource usage is
> uniformly distributed between all units of a group.
>
>
> To be a cross-subtarget tool, this needs to be a customization point. It’s
> not always so simple.
>

I agree. When I designed this logic, this point in particular was a bit
problematic. I have been thinking the same; it would be nice if scheduler
resource were able to specify how units are dynamically selected. We could
still have a default round_robin strategy in the absence of extra
information. This is definitely something that we should be looking into as
a future development. If you agree, at least to start, we could use the
roundrobin strategy.

> <snip>
>
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory operations.
> Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues
> for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  The
> number
> of slots in the load/store queues is unknown by the tool, since there is no
> mention of it in the scheduling model.  In practice, users can specify flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue
> to be
> equal to exactly N (an unsigned value). If N is zero, then the tool
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and stores.
> The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the load
> does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes that
> loads
> always may-alias store operations.  Essentially, this LSUnit doesn't
> perform any
> sort of alias analysis to rule out cases where loads and stores don't
> overlap
> with each other.  The downside of this approach however is that younger
> loads are
> never allowed to pass older stores.  To make it possible for a younger
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
>
> I’m surprised it isn’t optimistic by default. Being pessimistic hides
> other bottlenecks and seems less accurate. I guess consistency with other
> similar tools (IACA) should be the aim.
>

I am not sure about what IACA does in this case.
In case, it is easy to change the behavior optimistic.

>
> Note that, in the case of write-combining memory, rule 2. could be relaxed
> a bit
> to allow reordering of non-aliasing store operations.  That being said, at
> the
> moment, there is no way to further relax the memory model (flag -noalias
> is the
> only option).  Essentially, there is no option to specify a different
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory types.
>  * LSUnit doesn't know how to identify serializing operations and memory
> fences.
>
>
> Combining the static model with a sampled dynamic hardware event profile
> would be amazing.
>

> No assumption is made on the store buffer size.  As mentioned before,
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, and
> it
> doesn't attempt to identify cases where store-to-load forwarding would
> occur in
> practice.
>
>
> The LSUnits have a buffer size of course. The question is whether you
> really need to separately model issuing the instructions to a separate
> memory pipeline via a shared queue.
>
> LSUnit doesn't attempt to predict whether a load or store hits or misses
> the L1
> cache.  It only knows if an instruction "MayLoad" and/or "MayStore".  For
> loads,
> the scheduling model provides an "optimistic" load-to-use latency (which
> usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
>
> You’re optimistic here, which is good, but pessimistic with aliasing.
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations, nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and 'UnmodeledSideEffects' behaves
> like a
> "soft" load-barrier.  That means, it serializes loads without forcing a
> flush of
> the load queue.  Similarly, instructions flagged with both 'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full memory
> barrier
> is a 'MayLoad' and 'MayStore' instruction with 'UnmodeledSideEffects'.
> This is
> inaccurate, but it is the best that we can do at the moment with the
> current
> information available in LLVM.
>
>
> LLVM *should* have this information. It needs some design work though.
>

I agree.

>
> A load/store barrier consumes one entry of the load/store queue.  A
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass a
> load
> barrier.  Also, a younger store cannot pass a store barrier.  A younger
> load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store queue(s).
> That
> also means, by construction, all the older loads/stores have been executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag 'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing information to
> give
> an accurate report.  For example, the first sections of this document
> explained
> how the lack of knowledge about the processor negatively affects the
> performance
> analysis.  The lack of knowledge is often a consequence of how scheduling
> models
> are defined; as mentioned before, scheduling models intentionally don't
> describe
> processors in fine details.
>
>
> LLVM’s machine model should be optionally extended to model whatever a
> static analysis tool needs. There’s some dynamic behavior that can’t be
> modeled statically—predictive structures, the state of the pipeline
> entering a  loop—but machine model precision shouldn’t be the limiting
> factor for your tool.
>
> The accuracy of the performance analysis is also affected by assumptions
> made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of tight
> loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel' in
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to describe
> loop
> buffers.  However, the purpose of that field is to enable loop unrolling of
> tight loops; essentially, it affects the cost model used by pass
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a
> processor,
> and consequently doesn't try to predict the frontend throughput.
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same
> reasons,
> this tool intentionally doesn't model branch prediction.  That being said,
> this
> tool could be definitely extended in future to also account for the
> hardware
> frontend when doing performance analysis.  This would inevitably require
> extra
> (extensive) processor knowledge related to all the available decoding
> paths in
> the hardware frontend.
>
>
> If loops are ever definitely limited by decoder or fetch throughput, it
> would be good to know that. You don’t need to model the predictive aspect
> of it, or “all the paths”.
>
> You might as well say you don’t need to model issue/dispatch width, or
> retirement logic, because it’s decoupled from OOO execution via instruction
> buffers. The point of this tool is to tell you that there is definitely a
> bottleneck in a particular area of the pipeline.
>
> Would it be more appropriate to say that a simple abstract model of
> decoding is too inaccurate for your particular subtarget so there was not
> enough benefit to implementing it?
>

Ideally, we don't want to exclude the possibility to analyze the frontend
performance. In future, the tool should be extended to account for the
frontend too. Even Simon Pilgrim suggested (in a private conversation) how
it would be very useful to have at least a few information about the
frontend too.
Clement mentioned in his recent post on this thread that his team at Google
implemented a similar tool, and their tool analyzes the frontend too. I am
going to reply to his post after this. But the bottom line is that it would
be great if people contribute the frontend analysis to this tool.

> When computing the IPC, the tool assumes a zero-latency "perfect"
> fetch&decode
> stage; the full sequence of decoded instructions is immediately visible to
> the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading.  According to the
> tool,
> processor resources are not statically/dynamically partitioned.  Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of
> instructions.
> Instructions are placed in a queue and potentially executed out-of-order
> (based
> on the operand availability). The dispatch stage is definitely distinct
> from the
> issue stage.
>
>
> I wonder why in-order analysis doesn’t mostly fall out as a degenerate
> case in your tool. There’s some grey area where in-order processors
> hardware interlocks that cause stalls that aren’t explicit in the
> scheduling groups. Those processors would still benefit from your tool.
> Even if that target doesn’t benefit from the OOO simulation, it would be
> nice to compare the output of your tool with the observed performance to
> find bugs in the model.
>
> This model doesn't correctly describe processors where the dispatch/issue
> is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it is
> up to
> the compiler to predict the latency of instructions and package issue
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted
> to
> support processors with a single dispatch/issue stage. The execution flow
> would
> require some changes in the way how existing components (i.e.
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
>
> Ah…. Extended with future development sounds better.
>

Yeah. Sorry if there are so many TODOs.
This is also the reason why I think it makes sense to ask help to the
community because there is still so much work to do..

> The following sections describes other known limitations.  The goal is not
> to
> provide an extensive list of limitations; we want to report what we
> believe are
> the most important limitations, and suggest possible methods to overcome
> them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already mentioned
> how
> LLVM doesn't know about serializing operations and memory barriers.  Most
> of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't
> expose
> those properties.  Instead, both serializing operations and memory barriers
> "have side-effects" according to MCInstrDesc.  That is because, at least
> for
> scheduling purposes, knowing that an instruction has unmodeled side
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance report.
> One way
> to improve this is by reserving a couple of bits in field 'Flags' from
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations,
> since it
> affects a few out-of-order processors in LLVM.
>
>
> I don’t think OOO LLVM targets should be using itineraries. If those
> targets are still actively maintained, they could migrate.
>
> As mentioned in section 'Instruction Issue', class Scheduler delegates to
> an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction stages.
> This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch
> targets.
> In particular, the tool doesn't model any sort of branch prediction, nor
> does it
> attempt to track changes to the program counter.  The tool always assumes
> that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A warning
> is
> generated if the tool encounters a call instruction in the sequence.
> Return
> instructions are not evaluated, and therefore control flow is not affected.
> However, the tool still queries the processor scheduling model to obtain
> latency
> information for instructions that affect the control flow.
>
>
> By decompiling to MachineInst’s it would be easy to build a mostly
> complete CFG and call graph.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't know about
> the
> register files, and temporaries available in each register file for
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe register
> files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the retire
> control
> unit can retire every cycle.  Ideally, every processor should be able to
> specify
> the retire throughput (for example, by adding an extra field to the
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
> <snip>
>
>
> MachineOperand handles this. You just need to create the machine instrs.
>

Interesting. I couldn't find how to do it. It would be great if somebody
helps me on this.

> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors, a
> 'cmp/test' followed by a 'jmp' is fused into a single macro operation.  The
> advantage is that the fused pair only consumes a single slot in the
> dispatch
> group.
>
> As a future development, the tool should be extended to address
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the opcode
> pairs
> that can be fused together. That table could be exposed to the tool via the
> MCSubtargetInfo interface.  This is just an idea; there may be better ways
> to
> implement this.
>
>
> This is already expressed by subtarget code. That code just needs to be
> exposed as a common subtarget interface. Typically the interfaces take
> MachineInst, but in this case the opcode is probably sufficient.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance
> analysis.
>
>
> The machine model has this, but requires proper MachineInstrs.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting the
> tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a "variant"
> scheduling
> class descriptor. A variant scheduling class needs to be resolved
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know anything
> about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant scheduling
> class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework was
> meant
> to work for MachineInstr.
>
>
> There are good reasons for the scheduler to work with MachineInstrs, and
> any static analysis tool should work with MachineInstrs for the same
> reasons...
>

I agree. If we fix this part, then both the issues described by this
sections would disappear.

> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to correctly
> predict the runtime performance of the code. This tool must always be able
> to
> obtain the set of implicit/explicit register defs/uses for every
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool takes
> as
> input an assembly sequence. That sequence is parsed into a MCInst sequence
> with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can
> inspect
> the MCOperand sequence of an MCInst to identify register operands. However,
> there is no way to tell register operands that are definitions from
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target instructions
> and
> their operands. The opcode of a machine instruction (a MachineInstr
> object) can
> be used to query the instruction set through method `MCInstrInfo::get' to
> obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically
> generated
> via tablegen from the instruction set description in the target .td
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the cardinality
> of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the machine
> operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from
> MachineInstr
> objects through a lowering step. By default the lowering logic simply
> iterates
> over the machine operands of a MachineInstr, and converts/expands them into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That means,
> register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific opcodes.
> Some of
> these routines may lower operands in a way that potentially breaks (some
> of) the
> assumptions on the machine operand sequence which were valid for
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the targets,
> and
> the vast majority of the MachineInstr are lowered based on the default
> strategy
> which preserves the original machine operand sequence.  This is especially
> true
> for x86, where the custom lowering logic always preserves the original
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other
> targets.
> So we plan (with the help of the community) to find a proper mechanism to
> map
> when possible MCOperand indices back to MachineOperand indices of the
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we could
> have an
> index for every register MCOperand (or -1, if the operand didn't exist in
> the
> original MachineInstr). The mapping could look like this <0,1,3,2>.  Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the
> tool on
> the register operand sequence (In general, these instructions should be the
> minority of a target's instruction set). Unfortunately, we don't have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified by
> the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is always
> valid for MCInst. For example: implicit register defs, implicit register
> uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
>
> You just made a very strong argument for building the MachineInstrs before
> running mca. So I wonder why you didn’t do that.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this link:
> https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor resource
> names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep working on
> the
> tool with the help of the community to address all of the limitations
> described
> by the previous sections, and find good solutions/fixes for the design
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who
> really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
> There are a number of people on llvm-dev who can explain better than I how
> to decompile into MachineInstrs. I’m not totally opposed to checking in
> something that works with MCInstr, but this does run strongly contrary to
> the design of LLVM’s subtarget support.
>

That would be great! I would be very happy if somebody suggests how to do
it (or does it for me :-)).
Do you think the current design (modulo the changes suggested in the
review) would be acceptable to start?

Thanks again for your great feedback Andy!

-Andrea

>
> -Andy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/d5b4497a/attachment-0001.html>