[llvm-dev] [RFC] llvm-mca: a static performance analysis tool

Fri Mar 2 09:33:37 PST 2018

This looks absolutely awesome.  Thank you for sharing your work; I can 
see many ways this will be useful.

Philip

On 03/01/2018 09:22 AM, Andrea Di Biagio via llvm-dev wrote:
> Hi all,
>
> At Sony we developed an LLVM based performance analysis tool named 
> llvm-mca. We
> currently use it internally to statically measure the performance of 
> code, and
> to help triage potential problems with target scheduling models.  We 
> decided to
> post this RFC because we are interested in the feedback from the 
> community, and
> we also believe that other people might be interested in a tool like this.
>
> llvm-mca uses information which is already available in LLVM (e.g. 
> scheduling
> models) to statically measure the performance of machine code in a 
> specific cpu.
> Performance is measured in terms of throughput as well as processor 
> resource
> consumption.  The tool currently works for processors with an out-of-order
> backend, for which there is a scheduling model available in LLVM.
>
> The main goal of this tool is not just to predict the performance of 
> the code
> when run on the target, but also help with diagnosing potential 
> performance
> issues.
>
> Given an assembly code sequence, llvm-mca estimates the IPC 
> (instructions per
> cycle), as well as hardware resources pressure. The analysis and 
> reporting style
> were inspired by the IACA tool from Intel.
>
> The presence of long data dependency chains, as well as poor usage of 
> hardware
> resources may lead to bottlenecks in the back-end.  The tool is able 
> to generate
> a detailed report which should help with identifying and analyzing 
> sources of
> bottlenecks.
>
> Scheduling models are mostly used to compute instruction latencies, to 
> obtain
> read-advance information, and understand how processor resources are 
> used by
> instructions.  By design, the quality of the performance analysis 
> conducted by
> the tool is inevitably affected by the quality of the target 
> scheduling models.
>
> However, scheduling models intentionally do not describe all 
> processors details,
> since the goal is just to enable the scheduling of machine 
> instructions during
> compilation. That means, there are processor details which are not 
> important for
> the purpose of scheduling instructions (and therefore not described by the
> scheduling model), but are very important for this tool.
>
> A few examples of details that are missing in scheduling models are:
>  - Maximum number of instructions retired per cycle.
>  - Actual dispatch width (it often differs from the issue width).
>  - Number of temporary registers available for renaming.
>  - Number of read/write ports in the register file(s).
>  - Length of the load/store queue in the LSUnit.
>
> It is also very difficult to find a "good" abstract model to describe the
> behavior of out-of-order processors. So, we have to keep in mind that 
> all of
> these aspects are going to affect the quality of the static analysis 
> performed
> by the tool.
>
> An extensive list of known limitations is reported in one of the last 
> sections
> of this document. There is also a section related to design problems 
> which must
> be addressed (hopefully with the help of the community).  At the 
> moment, the
> tool has been mostly tested for x86 targets, but there are still several
> limitations, some of which could be overcome by integrating extra 
> information
> into the scheduling models.
>
> As was mentioned before, this tool has been (and is still being) used 
> internally
> in Sony to debug/triage issues in the btver2 scheduling model. We have 
> also
> tested it on other targets to check how generic the tool is. In our 
> experience,
> the tool makes it easy to identify simple mistakes like "wrong number 
> of micro
> opcodes specified for an instruction", or "wrong set of hardware 
> resources".
> Some of these mistakes are quite common (sometimes on mature models 
> too), and
> often difficult to catch.  Reports generated by this tool are simple 
> to analyze,
> and contain enough details to help triage most performance problems.
>
> 1. How the tool works
> ---------------------
>
> The tool takes assembly code as input. Assembly code is parsed into a 
> sequence
> of MCInst with the help of the existing LLVM target assembly parsers. 
> The parsed
> sequence of MCInst is then analyzed by a 'Backend' module to generate a
> performance report.
>
> The Backend module internally emulates the execution of the machine code
> sequence in a loop of iterations (which by default is 70). At the end 
> of this
> process, the backend collects a number of statistics which are then 
> printed out
> in the form of a report.
>
> Here is an example of performance report generated by the tool for a 
> dot-product
> of two packed float vectors of four elements. The analysis is 
> conducted for
> target x86, cpu btver2:
>
> ///////////////////
>
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Resources:
> [0] - JALU0
> [1] - JALU1
> [2] - JDiv
> [3] - JFPM
> [4] - JFPU0
> [5] - JFPU1
> [6] - JLAGU
> [7] - JSAGU
> [8] - JSTC
> [9] - JVIMUL
>
>
> Resource pressure per iteration:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]
>  -      -      -      -     2.00   1.00    -      -      - -
>
> Resource pressure by instruction:
> [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8] [9]        
> Instructions:
>  -      -      -      -      -     1.00    -      -      - -         
> vmulps    %xmm0, %xmm1, %xmm2
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm2, %xmm2, %xmm3
>  -      -      -      -     1.00    -      -      -      - -         
> vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Instruction Info:
> [1]: #uOps
> [2]: Latency
> [3]: RThroughput
> [4]: MayLoad
> [5]: MayStore
> [6]: HasSideEffects
>
> [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
>  1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2
>  1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3
>  1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4
>
> ///////////////////
>
> According to this report, the dot-product kernel has been executed 300 
> times,
> for a total of 900 instructions dynamically executed.
>
> The report is structured in three main sections.  A first section 
> collects a few
> performance numbers; the goal of this section is to give a very quick 
> overview
> of the performance throughput. In this example, the two important 
> perforamce
> indicators are a) the predicted total number of cycles, and b) the IPC.
> IPC is probably the most important throughput indicator. A big delta 
> between the
> Dispatch Width and the computed IPC is an indicator of potential 
> performance
> issues.
>
> The second section is the so-called "resource pressure view". This 
> view reports
> the average number of resource cycles consumed every iteration by 
> instructions
> for every processor resource unit available on the target. Information is
> structured in two tables. The first table reports the number of 
> resource cycles
> spent on average every iteration. The second table correlates the resource
> cycles to the machine instruction in the sequence. For example, every 
> iteration
> of the dot-product, instruction 'vmulps' always executes on resource 
> unit [5]
> (JFPU1 - floating point pipeline #1), consuming an average of 1 
> resource cycle
> per iteration.  Note that on Jaguar, vector FP multiply can only be 
> issued to
> pipeline JFPU1, while horizontal FP adds can only be issued to 
> pipeline JFPU0.
>
> The third (and last) section of the report shows the latency and 
> reciprocal
> throughput of every instruction in the sequence. That section also 
> reports extra
> information related to the number of micro opcodes, and opcode 
> properties (i.e.
> 'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
>
> The resource pressure view helps with identifying bottlenecks caused 
> by high
> usage of specific hardware resources.  Situations with resource 
> pressure mainly
> concentrated on a few resources should, in general, be avoided. Ideally,
> pressure should be uniformly distributed between multiple resources.
>
> Timeline View
> -------------
>
> A detailed report of each instruction's state transitions over time can be
> enabled using the command line flag '-timeline'.  This prints an extra 
> section
> in the report which contains the so-called "timeline view". Below is the
> timeline view for the dot-product example from the previous section.
>
> ///////////////
> Timeline view:
>                    012345
> Index    0123456789
>
> [0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2
> [0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3
> [0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2
> [1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3
> [1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4
>
> [2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2
> [2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3
> [2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4
>
>
> Average Wait times (based on the timeline view):
> [0]: Executions
> [1]: Average time spent waiting in a scheduler's queue
> [2]: Average time spent waiting in a scheduler's queue while ready
> [3]: Average time elapsed from WB until retire stage
>
>       [0]    [1]    [2]    [3]
> 0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2
> 1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3
> 2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4
> ///////////////
>
> The timeline view is very interesting because it shows how 
> instructions changed
> in state during execution.  It also gives an idea of how the tool "sees"
> instructions executed on the target.
>
> The timeline view is structured in two tables.  The first table shows how
> instructions change in state over time (measured in cycles); the 
> second table
> (named "Average Wait times") reports useful timing statistics which 
> should help
> diagnose performance bottlenecks caused by long data dependencies and 
> sub-optimal
> usage of hardware resources.
>
> An instruction in the timeline view is identified by a pair of 
> indices, where
> the 'first' index identifies an iteration, and the 'second' index is 
> the actual
> instruction index (i.e. where it appears in the code sequence).
>
> Excluding the first and last column, the remaining columns are in 
> cycles.  Cycles
> are numbered sequentially starting from 0.  The following characters 
> are used to
> describe the state of an instruction:
>
>  D : Instruction dispatched.
>  e : Instruction executing.
>  E : Instruction executed.
>  R : Instruction retired.
>  = : Instruction already dispatched, waiting to be executed.
>  - : Instruction executed, waiting to be retired.
>
> Based on the timeline view from the example, we know that:
>   - Instruction [1, 0] was dispatched at cycle 1.
>   - Instruction [1, 0] started executing at cycle 2.
>   - Instruction [1, 0] reached the write back stage at cycle 4.
>   - Instruction [1, 0] was retired at cycle 10.
>
> Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to 
> wait in
> the scheduler's queue for the operands to become available. By the 
> time the
> vmulps is dispatched, operands are already available, and pipeline 
> JFPU1 is
> ready to serve another instruction.  So the instruction can be immediately
> issued on the JFPU1 pipeline. That is demonstrated by the fact that the
> instruction only spent 1cy in the scheduler's queue.
>
> There is a gap of 5 cycles between the write-back stage and the retire 
> event.
> That is because instructions must retire in program order, so [1,0] 
> has to wait
> for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
>
> In the dot-product example, all instructions are in a RAW (Read After 
> Write)
> dependency chain.  Register %xmm2 written by the vmulps is immediately 
> used by
> the first vhaddps, and register %xmm3 written by the first vhaddps is 
> used by
> the second vhaddps.  Long data dependencies negatively affect the ILP
> (Instruction Level Parallelism).
>
> In the dot-product example, there are anti-dependencies introduced by
> instructions from different iterations.  However, those dependencies 
> can be
> removed at register renaming stage (at the cost of allocating register 
> aliases,
> and therefore consuming temporary registers).
>
> Table "Average Wait times" helps diagnose performance issues that are 
> caused by
> the presence of long latency instructions and potentially long data 
> dependencies
> which may limit the ILP.  Note that the tool by default assumes at 
> least 1cy
> between the dispatch event and the issue event.
>
> When the performance is limited by data dependencies and/or long latency
> instructions, the number of cycles spent while in the "ready" state is 
> expected
> to be very small when compared with the total number of cycles spent 
> in the
> scheduler's queue.  So the difference between the two counters is a good
> indicator of how big of an impact data dependencies had on the 
> execution of
> instructions.  When performance is mostly limited by the lack of hardware
> resources, the delta between the two counters is small. However, the 
> number of
> cycles spent in the queue tends to be bigger (i.e. more than 1-3cy) 
> especially
> when compared with other low latency instructions.
>
> Extra statistics to further diagnose performance issues.
> --------------------------------------------------------
>
> Flag '-verbose' enables extra statistics and performance counters for the
> dispatch logic, the reorder buffer, the retire control unit and the 
> register
> file.
>
> Below is an example of verbose output generated by the tool for the 
> dot-product
> example discussed in the previous sections.
>
> ///////////////////
> Iterations:     300
> Instructions:   900
> Total Cycles:   610
> Dispatch Width: 2
> IPC:            1.48
>
>
> Dynamic Dispatch Stall Cycles:
> RAT     - Register unavailable:                      0
> RCU     - Retire tokens unavailable:                 0
> SCHEDQ  - Scheduler full:                            272
> LQ      - Load queue full:                           0
> SQ      - Store queue full:                          0
> GROUP   - Static restrictions on the dispatch group: 0
>
>
> Register Alias Table:
> Total number of mappings created: 900
> Max number of mappings used:      35
>
>
> Dispatch Logic - number of cycles where we saw N instructions dispatched:
> [# dispatched], [# cycles]
>  0,              24  (3.9%)
>  1,              272  (44.6%)
>  2,              314  (51.5%)
>
>
> Schedulers - number of cycles where we saw N instructions issued:
> [# issued], [# cycles]
>  0,          7  (1.1%)
>  1,          306  (50.2%)
>  2,          297  (48.7%)
>
>
> Retire Control Unit - number of cycles where we saw N instructions 
> retired:
> [# retired], [# cycles]
>  0,           109  (17.9%)
>  1,           102  (16.7%)
>  2,           399  (65.4%)
>
>
> Scheduler's queue usage:
> JALU01,  0/20
> JFPU01,  18/18
> JLSAGU,  0/12
> ///////////////////
>
> Based on the verbose report, the backend was only able to dispatch two
> instructions 51.5% of the time.  The dispatch group was limited to one
> instruction 44.6% of the cycles, which corresponds to 272 cycles.
>
> If we look at section "Dynamic Dispatch Stall Cycles", we can see how 
> counter
> SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time 
> the dispatch
> logic is unable to dispatch a full group of two instructions because the
> scheduler's queue is full.
>
> Section "Scheduler's queue usage" shows how the maximum number of 
> buffer entries
> (i.e. scheduler's queue entries) used at runtime for resource JFPU01 
> reached its
> maximum. Note that AMD Jaguar implements three schedulers:
>   * JALU01 - A scheduler for ALU instructions
>   * JLSAGU - A scheduler for address generation
>   * JFPU01 - A scheduler floating point operations.
>
> The dot-product is a kernel of three floating point instructions (a vector
> multiply followed by two horizontal adds).  That explains why only the 
> floating
> point scheduler appears to be used according to section "Scheduler's queue
> usage".
>
> A full scheduler's queue is either caused by data dependency chains, 
> or by a
> sub-optimal usage of hardware resources.  Sometimes, resource pressure 
> can be
> mitigated by rewriting the kernel using different instructions that 
> consume
> different scheduler resources.  Schedulers with a small queue are less 
> resilient
> to bottlenecks caused by the presence of long data dependencies.
>
> In this example, we can conclude that the IPC is mostly limited by data
> dependencies, and not by resource pressure.
>
> LLVM-MCA instruction flow
> -------------------------
>
> This section describes the instruction flow through the out-of-order 
> backend, as
> well as the functional units involved in the process.
>
> An instruction goes through a default sequence of stages:
>     - Dispatch (Instruction is dispatched to the schedulers).
>     - Issue (Instruction is issued to the processor pipelines).
>     - Write Back (Instruction is executed, and results are written back).
>     - Retire (Instruction is retired; writes are architecturally 
> committed).
>
> The tool only models the out-of-order portion of a processor. 
> Therefore, the
> instruction fetch and decode stages are not modeled. Performance 
> bottlenecks in
> the frontend are not diagnosed by this tool.  The tool assumes that 
> instructions
> have all been decoded and placed in a queue.  Also, the tool doesn't know
> anything about branch prediction.
>
> Instruction Dispatch
> --------------------
>
> During the Dispatch stage, instructions are picked in program order 
> from a queue
> of already decoded instructions, and dispatched in groups to the hardware
> schedulers.  The dispatch logic is implemented by class DispatchUnit 
> in file
> Dispatch.h.
>
> The size of a dispatch group depends on the availability of hardware 
> resources,
> and it cannot exceed the value of field 'DispatchWidth' in class 
> DispatchUnit.
> Note that field DispatchWidth defaults to the value of field 
> 'IssueWidth' from
> the scheduling model.
>
> Users can override the DispatchWidth value with flag "-dispatch=<N>" 
> (where 'N'
> is an unsigned quantity).
>
> An instruction can be dispatched if:
>  - The size of the dispatch group is smaller than DispatchWidth
>  - There are enough entries in the reorder buffer
>  - There are enough temporary registers to do register renaming
>  - Schedulers are not full.
>
> Scheduling models don't describe register files, and therefore the 
> tool doesn't
> know if there is more than one register file, and how many temporaries are
> available for register renaming.
>
> By default, the tool (optimistically) assumes a single register file 
> with an
> unbounded number of temporary registers.  Users can limit the number of
> temporary registers available for register renaming using flag
> `-register-file-size=<N>`, where N is the number of temporaries.  A 
> value of
> zero for N means 'unbounded'.  Knowing how many temporaries are 
> available for
> register renaming, the tool can predict dispatch stalls caused by the 
> lack of
> temporaries.
>
> The number of reorder buffer entries consumed by an instruction 
> depends on the
> number of micro-opcodes it specifies in the target scheduling model 
> (see field
> 'NumMicroOpcodes' of tablegen class ProcWriteResources and its derived 
> classes;
> TargetSchedule.td).
>
> The reorder buffer is implemented by class RetireControlUnit (see 
> Dispatch.h).
> Its goal is to track the progress of instructions that are 
> "in-flight", and
> retire instructions in program order.  The number of entries in the 
> reorder
> buffer defaults to the value of field 'MicroOpBufferSize' from the target
> scheduling model.
>
> Instructions that are dispatched to the schedulers consume scheduler 
> buffer
> entries.  The tool queries the scheduling model to figure out the set of
> buffered resources consumed by an instruction.  Buffered resources are 
> treated
> like "scheduler" resources, and the field 'BufferSize' (from the processor
> resource tablegen definition) defines the size of the scheduler's queue.
>
> Zero latency instructions (for example NOP instructions) don't consume 
> scheduler
> resources.  However, those instructions still reserve a number of 
> slots in the
> reorder buffer.
>
> Instruction Issue
> -----------------
>
> As mentioned in the previous section, each scheduler resource 
> implements a queue
> of instructions.  An instruction has to wait in the scheduler's queue 
> until
> input register operands become available.  Only at that point, does the
> instruction becomes eligible for execution and may be issued (potentially
> out-of-order) to a pipeline for execution.
>
> Instruction latencies can be computed by the tool with the help of the
> scheduling model; latency values are defined by the scheduling model 
> through
> ProcWriteResources objects.
>
> Class Scheduler (see file Scheduler.h) knows how to emulate multiple 
> processor
> schedulers.  A Scheduler is responsible for tracking data 
> dependencies, and
> dynamically select which processor resources are consumed/used by 
> instructions.
>
> Internally, the Scheduler class delegates the management of processor 
> resource
> units and resource groups to the ResourceManager class. 
> ResourceManager is also
> responsible for selecting resource units that are effectively consumed by
> instructions.  For example, if an instruction consumes 1cy of a 
> resource group,
> the ResourceManager object selects one of the available units from the 
> group; by
> default, it uses a round-robin selector to guarantee that resource 
> usage is
> uniformly distributed between all units of a group.
>
> Internally, class Scheduler implements three instruction queues:
>   - WaitQueue: a queue of instructions whose operands are not ready yet.
>   - ReadyQueue: a queue of instructions ready to execute.
>   - IssuedQueue: a queue of instructions executing.
>
> Depending on the operands availability, instructions that are 
> dispatched to the
> Scheduler are either placed into the WaitQueue or into the ReadyQueue.
>
> Every cycle, class Scheduler checks if instructions can be moved from the
> WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue 
> can be
> issued to the underlying pipelines.  The algorithm prioritizes older
> instructions over younger instructions.
>
> Objects of class ResourceState (see Scheduler.h) describe processor 
> resources.
> There is an instance of class ResourceState for each single processor 
> resource
> specified by the scheduling model.  A ResourceState object for a processor
> resource with multiple units dynamically tracks the availability of 
> every single
> unit.  For example, the ResourceState of a resource group tracks the
> availability of every resource in that group.  Internally, ResourceState
> implements a round-robin selector to dynamically pick the next unit to 
> use from
> the group.
>
> Write-Back and Retire Stage
> ---------------------------
>
> Issued instructions are moved from the ReadyQueue to the IssuedQueue.  
> There,
> instructions wait until they reach the write-back stage.  At that 
> point, they
> get removed from the queue and the retire control unit is notified.
>
> On the event of "instruction executed", the retire control unit flags the
> instruction as "ready to retire".
>
> Instruction are retired in program order; an "instruction retired" 
> event is sent
> to the register file which frees the temporary registers allocated for the
> instruction at register renaming stage.
>
> Load/Store Unit and Memory Consistency Model
> --------------------------------------------
>
> The tool attempts to emulate out-of-order execution of memory 
> operations.  Class
> LSUnit (see file LSUnit.h) emulates a load/store unit implementing 
> queues for
> speculative execution of loads and stores.
>
> Each load (or store) consumes an entry in the load (or store) queue.  
> The number
> of slots in the load/store queues is unknown by the tool, since there 
> is no
> mention of it in the scheduling model.  In practice, users can specify 
> flag
> `-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the 
> queue to be
> equal to exactly N (an unsigned value). If N is zero, then the tool 
> assumes an
> unbounded queue (this is the default).
>
> LSUnit implements a relaxed consistency model for memory loads and 
> stores. The
> rules are:
> 1) A younger load is allowed to pass an older load only if there is no
>    intervening store in between the two loads.
> 2) An younger store is not allowed to pass an older store.
> 3) A younger store is not allowed to pass an older load.
> 4) A younger load is allowed to pass an older store provided that the 
> load does
>    not alias with the store.
>
> By default, this class conservatively (i.e. pessimistically) assumes 
> that loads
> always may-alias store operations.  Essentially, this LSUnit doesn't 
> perform any
> sort of alias analysis to rule out cases where loads and stores don't 
> overlap
> with each other.  The downside of this approach however is that 
> younger loads are
> never allowed to pass older stores.  To make it possible for a younger 
> load to
> pass an older store, users can use the command line flag -noalias.  Under
> 'noalias', a younger load is always allowed to pass an older store.
>
> Note that, in the case of write-combining memory, rule 2. could be 
> relaxed a bit
> to allow reordering of non-aliasing store operations.  That being 
> said, at the
> moment, there is no way to further relax the memory model (flag 
> -noalias is the
> only option).  Essentially, there is no option to specify a different 
> memory
> type (for example: write-back, write-combining, write-through; etc.) and
> consequently to weaken or strengthen the memory model.
>
> Other limitations are:
>  * LSUnit doesn't know when store-to-load forwarding may occur.
>  * LSUnit doesn't know anything about the cache hierarchy and memory 
> types.
>  * LSUnit doesn't know how to identify serializing operations and 
> memory fences.
>
> No assumption is made on the store buffer size.  As mentioned before, 
> LSUnit
> conservatively assumes a may-alias relation between loads and stores, 
> and it
> doesn't attempt to identify cases where store-to-load forwarding would 
> occur in
> practice.
>
> LSUnit doesn't attempt to predict whether a load or store hits or 
> misses the L1
> cache.  It only knows if an instruction "MayLoad" and/or "MayStore".  
> For loads,
> the scheduling model provides an "optimistic" load-to-use latency 
> (which usually
> matches the load-to-use latency for when there is a hit in the L1D).
>
> Class MCInstrDesc in LLVM doesn't know about serializing operations, nor
> memory-barrier like instructions.  LSUnit conservatively assumes that an
> instruction which has both 'MayLoad' and 'UnmodeledSideEffects' 
> behaves like a
> "soft" load-barrier.  That means, it serializes loads without forcing 
> a flush of
> the load queue.  Similarly, instructions flagged with both 'MayStore' and
> 'UnmodeledSideEffects' are treated like store barriers.  A full memory 
> barrier
> is a 'MayLoad' and 'MayStore' instruction with 
> 'UnmodeledSideEffects'.  This is
> inaccurate, but it is the best that we can do at the moment with the 
> current
> information available in LLVM.
>
> A load/store barrier consumes one entry of the load/store queue.  A 
> load/store
> barrier enforces ordering of loads/stores.  A younger load cannot pass 
> a load
> barrier.  Also, a younger store cannot pass a store barrier.  A 
> younger load has
> to wait for the memory/load barrier to execute.  A load/store barrier is
> "executed" when it becomes the oldest entry in the load/store 
> queue(s). That
> also means, by construction, all the older loads/stores have been 
> executed.
>
> In conclusion the full set of rules is:
>   1. A store may not pass a previous store.
>   2. A load may not pass a previous store unless flag 'NoAlias' is set.
>   3. A load may pass a previous load.
>   4. A store may not pass a previous load (regardless of flag 'NoAlias').
>   5. A load has to wait until an older load barrier is fully executed.
>   6. A store has to wait until an older store barrier is fully executed.
>
> Known limitations
> -----------------
> Previous sections described cases where the tool is missing 
> information to give
> an accurate report.  For example, the first sections of this document 
> explained
> how the lack of knowledge about the processor negatively affects the 
> performance
> analysis.  The lack of knowledge is often a consequence of how 
> scheduling models
> are defined; as mentioned before, scheduling models intentionally 
> don't describe
> processors in fine details.
>
> The accuracy of the performance analysis is also affected by 
> assumptions made by
> the processor model used by the tool.
>
> Most recent Intel and AMD processors implement dedicated 
> LoopBuffer/OpCache in
> the hardware frontend to speedup the throughput in the presence of 
> tight loops.
> The presence of these buffers complicates the decoding logic, and requires
> knowledge on the branch predictor too.  Class 'SchedMachineModel' in 
> tablegen
> provides a field named 'LoopMicroOpBufferSize' which is used to 
> describe loop
> buffers.  However, the purpose of that field is to enable loop 
> unrolling of
> tight loops; essentially, it affects the cost model used by pass 
> loop-unroll.
>
> By design, the tool only cares about the out-of-order portion of a 
> processor,
> and consequently doesn't try to predict the frontend throughput.  
> Processors may
> implement complex decoding schemes; statically predicting the frontend
> throughput is in general beyond the scope of this tool.  For the same 
> reasons,
> this tool intentionally doesn't model branch prediction.  That being 
> said, this
> tool could be definitely extended in future to also account for the 
> hardware
> frontend when doing performance analysis.  This would inevitably 
> require extra
> (extensive) processor knowledge related to all the available decoding 
> paths in
> the hardware frontend.
>
> When computing the IPC, the tool assumes a zero-latency "perfect" 
> fetch&decode
> stage; the full sequence of decoded instructions is immediately 
> visible to the
> dispatch logic from the start.
>
> The tool doesn't know about simultaneous mutithreading. According to 
> the tool,
> processor resources are not statically/dynamically partitioned. Processor
> resources are fully available to the hardware thread executing the
> microbenchmark.
>
> The execution model implemented by this tool assumes that instructions are
> firstly dispatched in groups to hardware schedulers, and then issued to
> pipelines for execution.  The model assumes dynamic scheduling of 
> instructions.
> Instructions are placed in a queue and potentially executed 
> out-of-order (based
> on the operand availability). The dispatch stage is definitely 
> distinct from the
> issue stage.
>
> This model doesn't correctly describe processors where the 
> dispatch/issue is a
> single stage. This is what happens for example in VLIW processors, where
> instructions are packaged and statically scheduled at compile time; it 
> is up to
> the compiler to predict the latency of instructions and package issue 
> groups
> accordingly. For such targets, there is no dynamic scheduling done by the
> hardware.
>
> Existing classes (DispatchUnit, Scheduler, etc.) could be 
> extended/adapted to
> support processors with a single dispatch/issue stage. The execution 
> flow would
> require some changes in the way how existing components (i.e. 
> DispatchUnit,
> Scheduler, etc.) interact. This can be a future development.
>
> The following sections describes other known limitations.  The goal is 
> not to
> provide an extensive list of limitations; we want to report what we 
> believe are
> the most important limitations, and suggest possible methods to 
> overcome them.
>
> Load/Store barrier instructions and serializing operations
> ----------------------------------------------------------
> Section "Load/Store Unit and Memory Consistency Model" already 
> mentioned how
> LLVM doesn't know about serializing operations and memory barriers.  
> Most of it
> boils down to the fact that class MCInstrDesc (intentionally) doesn't 
> expose
> those properties.  Instead, both serializing operations and memory 
> barriers
> "have side-effects" according to MCInstrDesc.  That is because, at 
> least for
> scheduling purposes, knowing that an instruction has unmodeled side 
> effects is
> often enough to treat the instruction like a compiler scheduling barrier.
>
> A performance analysis tool could use the extra knowledge on barriers and
> serializing operations to generate a more accurate performance 
> report.  One way
> to improve this is by reserving a couple of bits in field 'Flags' from 
> class
> MCInstrDesc: one bit for barrier operations, and another bit to mark
> instructions as serializing operations.
>
> Lack of support for instruction itineraries
> -------------------------------------------
> The current version of the tool doesn't know how to process instruction
> itineraries.  This is probably one of the most important limitations, 
> since it
> affects a few out-of-order processors in LLVM.
>
> As mentioned in section 'Instruction Issue', class Scheduler delegates 
> to an
> instance of class ResourceManager the handling of processor resources.
> ResourceManager is where most of the scheduling logic is implemented.
>
> Adding support for instruction itineraries requires that we teach
> ResourceManager how to handle functional units and instruction 
> stages.  This
> development can be a future extension, and it would probably require a few
> changes to the ResourceManager interface.
>
> Instructions that affect control flow are not correctly modeled
> ---------------------------------------------------------------
> Examples of instructions that affect the control flow are: return, 
> indirect
> branches, calls, etc.  The tool doesn't try to predict/evaluate branch 
> targets.
> In particular, the tool doesn't model any sort of branch prediction, 
> nor does it
> attempt to track changes to the program counter.  The tool always 
> assumes that
> the input assembly sequence is the body of a microbenchmark (a simple loop
> executed for a number of iterations). The "next" instruction in 
> sequence is
> always the next instruction to dispatch.
>
> Call instructions default to an arbitrary high latency of 100cy. A 
> warning is
> generated if the tool encounters a call instruction in the sequence.  
> Return
> instructions are not evaluated, and therefore control flow is not 
> affected.
> However, the tool still queries the processor scheduling model to 
> obtain latency
> information for instructions that affect the control flow.
>
> Possible extensions to the scheduling model
> -------------------------------------------
> Section "Instruction Dispatch" explained how the tool doesn't know 
> about the
> register files, and temporaries available in each register file for 
> register
> renaming purposes.
>
> The LLVM scheduling model could be extended to better describe 
> register files.
> Ideally, scheduling model should be able to define:
>  - The size of each register file
>  - How many temporary registers are available for register renaming
>  - How register classes map to register files
>
> The scheduling model doesn't specify the retire throughput (i.e. how many
> instructions can be retired every cycle).  Users can specify flag
> `-max-retire-per-cycle=<uint>` to limit how many instructions the 
> retire control
> unit can retire every cycle.  Ideally, every processor should be able 
> to specify
> the retire throughput (for example, by adding an extra field to the 
> scheduling
> model tablegen class).
>
> Known limitations on X86 processors
> -----------------------------------
>
> 1) Partial register updates versus full register updates.
>
> On x86-64, a 32-bit GPR write fully updates the super-register. Example:
>       add %edi %eax    ## eax += edi
>
> Here, register %eax aliases the lower half of 64-bit register %rax. On 
> x86-64,
> register %rax is fully updated by the 'add' (the upper half of %rax is 
> zeroed).
> Essentially, it "kills" any previous definition of (the upper half of) 
> register
> %rax.
>
> On the other hand, 8/16 bit register writes only perform a so-called 
> "partial
> register update". Example:
>       add %di, %ax     ## ax += di
>
> Here, register %eax is only partially updated. To be more specific, 
> the lower
> half of %eax is set, and the upper half is left unchanged. There is 
> also no
> change in the upper 48 bits of register %rax.
>
> To get accurate performance analysis, the tool has to know which 
> instructions
> perform a partial register update, and which instructions fully update the
> destination's super-register.
>
> One way to expose this information is (again) via tablegen.  For 
> example, we
> could add a flag in the tablegen instruction class to tag instructions 
> that
> perform partial register updates. Something like this: 'bit
> hasPartialRegisterUpdate = 1'. However, this would force a `let
> hasPartialRegisterUpdate = 0` on several instruction definitions.
>
> Another approach is to have a MCSubtargetInfo hook similar to this:
>     virtual bool updatesSuperRegisters(unsigned short opcode) { return 
> false; }
>
> Targets will be able to override this method if needed.  Again, this 
> is just an
> idea. But the plan is to have this fixed as a future development.
>
> 2) Macro Op fusion.
>
> The tool doesn't know about macro-op fusion. On modern x86 processors, a
> 'cmp/test' followed by a 'jmp' is fused into a single macro 
> operation.  The
> advantage is that the fused pair only consumes a single slot in the 
> dispatch
> group.
>
> As a future development, the tool should be extended to address 
> macro-fusion.
> Ideally, we could have LLVM generate a table enumerating all the 
> opcode pairs
> that can be fused together. That table could be exposed to the tool 
> via the
> MCSubtargetInfo interface.  This is just an idea; there may be better 
> ways to
> implement this.
>
> 3) Intel processors: mixing legacy SSE with AVX instructions.
>
> On modern Intel processors with AVX, mixing legacy SSE code with AVX code
> negatively impacts the performance.  The tool is not aware of this 
> issue, and
> the performance penalty is not accounted when doing the analysis. This is
> something that we would like to improve in future.
>
> 4) Zero-latency register moves and Zero-idioms.
>
> Most modern AMD/Intel processors know how to optimize out 
> register-register
> moves and zero idioms at register renaming stage. The tool doesn't know
> about these patterns, and this may negatively impact the performance 
> analysis.
>
> Known design problems
> ---------------------
> This section describes two design issues that are currently affecting 
> the tool.
> The long term plan is to "fix" these issues.
>
> 1) Variant instructions not correctly modeled.
>
> The tool doesn't know how to analyze instructions with a "variant" 
> scheduling
> class descriptor. A variant scheduling class needs to be resolved 
> dynamically.
> The "actual" scheduling class often depends on the subtarget, as well as
> properties of the specific MachineInstr object.
>
> Unfortunately, the tool manipulates MCInst, and it doesn't know 
> anything about
> MachineInstr. As a consequence, the tool cannot use the existing machine
> subtarget hooks that are normally used to resolve the variant 
> scheduling class.
> This is a major design issue which mostly affects ARM/AArch64 targets.  It
> mostly boils down to the fact that the existing scheduling framework 
> was meant
> to work for MachineInstr.
>
> When the tool encounters a "variant" instruction, it assumes a generic 1cy
> latency. However, the tool would not be able to tell which processor 
> resources
> are effectively consumed by the variant instruction.
>
> 2) MCInst and MCInstrDesc.
>
> Performance analysis tools require data dependency information to 
> correctly
> predict the runtime performance of the code. This tool must always be 
> able to
> obtain the set of implicit/explicit register defs/uses for every 
> instruction of
> the input assembly sequence.
>
> In the first section of this document, it was mentioned how the tool 
> takes as
> input an assembly sequence. That sequence is parsed into a MCInst 
> sequence with
> the help of assembly parsers available from the targets.
>
> A MCInst is a very low-level instruction representation. The tool can 
> inspect
> the MCOperand sequence of an MCInst to identify register operands. 
> However,
> there is no way to tell register operands that are definitions from 
> register
> operands that are uses.
>
> In LLVM, class MCInstrDesc is used to fully describe target 
> instructions and
> their operands. The opcode of a machine instruction (a MachineInstr 
> object) can
> be used to query the instruction set through method `MCInstrInfo::get' 
> to obtain
> the associated MCInstrDesc object.
>
> However class MCInstrDesc describes properties and operands of 
> MachineInstr
> objects. Essentially, MCInstrDesc is not meant to be used to describe 
> MCInst
> objects.  To be more specific, MCInstrDesc objects are automatically 
> generated
> via tablegen from the instruction set description in the target .td 
> files.  For
> example, field `MCInstrDesc::NumDefs' is always equal to the 
> cardinality of the
> `(outs)` set from the tablegen instruction definition.
>
> By construction, register definitions always appear at the beginning 
> of the
> MachineOperands list in MachineInstr. Basically, the (outs) are the first
> operands of a MachineInstr, and the (ins) will come after in the 
> machine operand
> list. Knowing the number of register definitions is enough to identify
> all the register operands that are definitions.
>
> In a normal compilation process, MCInst objects are generated from 
> MachineInstr
> objects through a lowering step. By default the lowering logic simply 
> iterates
> over the machine operands of a MachineInstr, and converts/expands them 
> into
> equivalent MCOperand objects.
>
> The default lowering strategy has the advantage of preserving all of the
> above mentioned assumptions on the machine operand sequence. That 
> means, register
> definitions would still be at the beginning of the MCOperand sequence, and
> register uses would come after.
>
> Targets may still define custom lowering routines for specific 
> opcodes. Some of
> these routines may lower operands in a way that potentially breaks 
> (some of) the
> assumptions on the machine operand sequence which were valid for 
> MachineInstr.
> Luckily, this is not the most common form of lowering done by the 
> targets, and
> the vast majority of the MachineInstr are lowered based on the default 
> strategy
> which preserves the original machine operand sequence.  This is 
> especially true
> for x86, where the custom lowering logic always preserves the original 
> (i.e.
> from the MachineInstr) operand sequence.
>
> This tool currently works under the strong (and potentially incorrect)
> assumption that register def/uses in a MCInst can always be identified by
> querying the machine instruction descriptor for the opcode. This 
> assumption made
> it possible to develop this tool and get good numbers at least for the
> processors available in the x86 backend.
>
> That being said, the analysis is still potentially incorrect for other 
> targets.
> So we plan (with the help of the community) to find a proper mechanism 
> to map
> when possible MCOperand indices back to MachineOperand indices of the 
> equivalent
> MachineInstr.  This would be equivalent to describing changes made by the
> lowering step which affected the operand sequence. For example, we 
> could have an
> index for every register MCOperand (or -1, if the operand didn't exist 
> in the
> original MachineInstr). The mapping could look like this <0,1,3,2>.  Here,
> MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
>
> This information could be automatically generated via tablegen for all the
> instructions whose custom lowering step breaks assumptions made by the 
> tool on
> the register operand sequence (In general, these instructions should 
> be the
> minority of a target's instruction set). Unfortunately, we don't have that
> information now.  As a consequence, we assume that the number of explicit
> register definitions is the same number specified in MCInstrDesc.  We also
> assume that register definitions always come first in the operand 
> sequence.
>
> In conclusion: these are for now the strong assumptions made by the tool:
>   * The number of explicit and implicit register definitions in a MCInst
>     matches the number of explicit and implicit definitions specified 
> by the
>     MCInstrDesc object.
>   * Register uses always come after register definitions.
>   * If an opcode specifies an optional definition, then the optional
>     definition is always the last register operand in the sequence.
>
> Note that some of the information accessible from the MCInstrDesc is 
> always
> valid for MCInst. For example: implicit register defs, implicit 
> register uses
> and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still 
> apply to
> MCInst. The tool knows about this, and uses that information during its
> analysis.
>
> What to do next
> ---------------
> The source code has been uploaded for review on phabricator at this 
> link: https://reviews.llvm.org/D43951.
>
> The review covers two patches:
> A first (very small) patch that always enables the generation of processor
> resource names in the SubtargetEmitter. Currently, the processor 
> resource names
> are only generated for debugging purposes, but are needed by the tool to
> generate user friendly reports, so we would like to always generate them.
> A second patch with the actual static analysis tool (in llvm/tools).
>
> Once these first two patches are committed, the plan is to keep 
> working on the
> tool with the help of the community to address all of the limitations 
> described
> by the previous sections, and find good solutions/fixes for the design 
> issues
> described by section "Known design problems".
>
> We hope the community will find this tool useful like we have.
>
> Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell 
> who really
> helped me a lot by suggesting improvements and testing the tool.
>
> Thanks for your time.
> -Andrea
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180302/8e72fade/attachment-0001.html>