<div dir="ltr">Hi all,<br><br>At Sony we developed an LLVM based performance analysis tool named llvm-mca. We<br>currently use it internally to statically measure the performance of code, and<br>to help triage potential problems with target scheduling models.  We decided to<br>post this RFC because we are interested in the feedback from the community, and<br>we also believe that other people might be interested in a tool like this.<br><br>llvm-mca uses information which is already available in LLVM (e.g. scheduling<br>models) to statically measure the performance of machine code in a specific cpu.<br>Performance is measured in terms of throughput as well as processor resource<br>consumption.  The tool currently works for processors with an out-of-order<br>backend, for which there is a scheduling model available in LLVM.<br><br>The main goal of this tool is not just to predict the performance of the code<br>when run on the target, but also help with diagnosing potential performance<br>issues.<br><br>Given an assembly code sequence, llvm-mca estimates the IPC (instructions per<br>cycle), as well as hardware resources pressure. The analysis and reporting style<br>were inspired by the IACA tool from Intel.<br><br>The presence of long data dependency chains, as well as poor usage of hardware<br>resources may lead to bottlenecks in the back-end.  The tool is able to generate<br>a detailed report which should help with identifying and analyzing sources of<br>bottlenecks.<br><br>Scheduling models are mostly used to compute instruction latencies, to obtain<br>read-advance information, and understand how processor resources are used by<br>instructions.  By design, the quality of the performance analysis conducted by<br>the tool is inevitably affected by the quality of the target scheduling models. <br><br>However, scheduling models intentionally do not describe all processors details,<br>since the goal is just to enable the scheduling of machine instructions during<br>compilation. That means, there are processor details which are not important for<br>the purpose of scheduling instructions (and therefore not described by the<br>scheduling model), but are very important for this tool.<br><br>A few examples of details that are missing in scheduling models are:<br> - Maximum number of instructions retired per cycle.<br> - Actual dispatch width (it often differs from the issue width).<br> - Number of temporary registers available for renaming.<br> - Number of read/write ports in the register file(s).<br> - Length of the load/store queue in the LSUnit.<br><br>It is also very difficult to find a "good" abstract model to describe the<br>behavior of out-of-order processors. So, we have to keep in mind that all of<br>these aspects are going to affect the quality of the static analysis performed<br>by the tool.<br><br>An extensive list of known limitations is reported in one of the last sections<br>of this document. There is also a section related to design problems which must<br>be addressed (hopefully with the help of the community).  At the moment, the<br>tool has been mostly tested for x86 targets, but there are still several<br>limitations, some of which could be overcome by integrating extra information<br>into the scheduling models.<br><br>As was mentioned before, this tool has been (and is still being) used internally<br>in Sony to debug/triage issues in the btver2 scheduling model. We have also<br>tested it on other targets to check how generic the tool is. In our experience,<br>the tool makes it easy to identify simple mistakes like "wrong number of micro<br>opcodes specified for an instruction", or "wrong set of hardware resources".<br>Some of these mistakes are quite common (sometimes on mature models too), and<br>often difficult to catch.  Reports generated by this tool are simple to analyze,<br>and contain enough details to help triage most performance problems.<br><br>1. How the tool works<br>---------------------<br><br>The tool takes assembly code as input. Assembly code is parsed into a sequence<br>of MCInst with the help of the existing LLVM target assembly parsers. The parsed<br>sequence of MCInst is then analyzed by a 'Backend' module to generate a<br>performance report.<br><br>The Backend module internally emulates the execution of the machine code<br>sequence in a loop of iterations (which by default is 70). At the end of this<br>process, the backend collects a number of statistics which are then printed out<br>in the form of a report.<br><br>Here is an example of performance report generated by the tool for a dot-product<br>of two packed float vectors of four elements. The analysis is conducted for<br>target x86, cpu btver2:<br><br>///////////////////<br><br>Iterations:     300<br>Instructions:   900<br>Total Cycles:   610<br>Dispatch Width: 2<br>IPC:            1.48<br><br><br>Resources:<br>[0] - JALU0<br>[1] - JALU1<br>[2] - JDiv<br>[3] - JFPM<br>[4] - JFPU0<br>[5] - JFPU1<br>[6] - JLAGU<br>[7] - JSAGU<br>[8] - JSTC<br>[9] - JVIMUL<br><br><br>Resource pressure per iteration:<br>[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    <br> -      -      -      -     2.00   1.00    -      -      -      -     <br><br>Resource pressure by instruction:<br>[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]        Instructions:<br> -      -      -      -      -     1.00    -      -      -      -         vmulps    %xmm0, %xmm1, %xmm2<br> -      -      -      -     1.00    -      -      -      -      -         vhaddps    %xmm2, %xmm2, %xmm3<br> -      -      -      -     1.00    -      -      -      -      -         vhaddps    %xmm3, %xmm3, %xmm4<br><br><br>Instruction Info:<br>[1]: #uOps<br>[2]: Latency<br>[3]: RThroughput<br>[4]: MayLoad<br>[5]: MayStore<br>[6]: HasSideEffects<br><br>[1]    [2]    [3]    [4]    [5]    [6]    Instructions:<br> 1      2     1.00                        vmulps    %xmm0, %xmm1, %xmm2<br> 1      3     1.00                        vhaddps    %xmm2, %xmm2, %xmm3<br> 1      3     1.00                        vhaddps    %xmm3, %xmm3, %xmm4<br><br>///////////////////<br><br>According to this report, the dot-product kernel has been executed 300 times,<br>for a total of 900 instructions dynamically executed.<br><br>The report is structured in three main sections.  A first section collects a few<br>performance numbers; the goal of this section is to give a very quick overview<br>of the performance throughput. In this example, the two important perforamce<br>indicators are a) the predicted total number of cycles, and b) the IPC. <br>IPC is probably the most important throughput indicator. A big delta between the<br>Dispatch Width and the computed IPC is an indicator of potential performance<br>issues.<br><br>The second section is the so-called "resource pressure view".  This view reports<br>the average number of resource cycles consumed every iteration by instructions<br>for every processor resource unit available on the target.  Information is<br>structured in two tables. The first table reports the number of resource cycles<br>spent on average every iteration. The second table correlates the resource<br>cycles to the machine instruction in the sequence. For example, every iteration<br>of the dot-product, instruction 'vmulps' always executes on resource unit [5]<br>(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle<br>per iteration.  Note that on Jaguar, vector FP multiply can only be issued to<br>pipeline JFPU1, while horizontal FP adds can only be issued to pipeline JFPU0.<br><br>The third (and last) section of the report shows the latency and reciprocal<br>throughput of every instruction in the sequence. That section also reports extra<br>information related to the number of micro opcodes, and opcode properties (i.e.<br>'MayLoad', 'MayStore' and 'UnmodeledSideEffects').<br><br>The resource pressure view helps with identifying bottlenecks caused by high<br>usage of specific hardware resources.  Situations with resource pressure mainly<br>concentrated on a few resources should, in general, be avoided.  Ideally,<br>pressure should be uniformly distributed between multiple resources.<br><br>Timeline View<br>-------------<br><br>A detailed report of each instruction's state transitions over time can be<br>enabled using the command line flag '-timeline'.  This prints an extra section<br>in the report which contains the so-called "timeline view".  Below is the<br>timeline view for the dot-product example from the previous section.<br><br>///////////////<br>Timeline view:<br>                   012345<br>Index    0123456789      <br><br>[0,0]    DeeER.    .    .    vmulps    %xmm0, %xmm1, %xmm2<br>[0,1]    D==eeeER  .    .    vhaddps    %xmm2, %xmm2, %xmm3<br>[0,2]    .D====eeeER    .    vhaddps    %xmm3, %xmm3, %xmm4<br><br>[1,0]    .DeeE-----R    .    vmulps    %xmm0, %xmm1, %xmm2<br>[1,1]    . D=eeeE---R   .    vhaddps    %xmm2, %xmm2, %xmm3<br>[1,2]    . D====eeeER   .    vhaddps    %xmm3, %xmm3, %xmm4<br><br>[2,0]    .  DeeE-----R  .    vmulps    %xmm0, %xmm1, %xmm2<br>[2,1]    .  D====eeeER  .    vhaddps    %xmm2, %xmm2, %xmm3<br>[2,2]    .   D======eeeER    vhaddps    %xmm3, %xmm3, %xmm4<br><br><br>Average Wait times (based on the timeline view):<br>[0]: Executions<br>[1]: Average time spent waiting in a scheduler's queue<br>[2]: Average time spent waiting in a scheduler's queue while ready<br>[3]: Average time elapsed from WB until retire stage<br><br>      [0]    [1]    [2]    [3]<br>0.     3     1.0    1.0    3.3      vmulps    %xmm0, %xmm1, %xmm2<br>1.     3     3.3    0.7    1.0      vhaddps    %xmm2, %xmm2, %xmm3<br>2.     3     5.7    0.0    0.0      vhaddps    %xmm3, %xmm3, %xmm4<br>///////////////<br><br>The timeline view is very interesting because it shows how instructions changed<br>in state during execution.  It also gives an idea of how the tool "sees"<br>instructions executed on the target.<br><br>The timeline view is structured in two tables.  The first table shows how<br>instructions change in state over time (measured in cycles); the second table<br>(named "Average Wait times") reports useful timing statistics which should help<br>diagnose performance bottlenecks caused by long data dependencies and sub-optimal<br>usage of hardware resources.<br><br>An instruction in the timeline view is identified by a pair of indices, where<br>the 'first' index identifies an iteration, and the 'second' index is the actual<br>instruction index (i.e. where it appears in the code sequence).<br><br>Excluding the first and last column, the remaining columns are in cycles.  Cycles<br>are numbered sequentially starting from 0.  The following characters are used to<br>describe the state of an instruction:<br><br> D : Instruction dispatched.<br> e : Instruction executing.<br> E : Instruction executed.<br> R : Instruction retired.<br> = : Instruction already dispatched, waiting to be executed.<br> - : Instruction executed, waiting to be retired.<br><br>Based on the timeline view from the example, we know that:<br>  - Instruction [1, 0] was dispatched at cycle 1.<br>  - Instruction [1, 0] started executing at cycle 2.<br>  - Instruction [1, 0] reached the write back stage at cycle 4.<br>  - Instruction [1, 0] was retired at cycle 10.<br><br>Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to wait in<br>the scheduler's queue for the operands to become available. By the time the<br>vmulps is dispatched, operands are already available, and pipeline JFPU1 is<br>ready to serve another instruction.  So the instruction can be immediately<br>issued on the JFPU1 pipeline. That is demonstrated by the fact that the<br>instruction only spent 1cy in the scheduler's queue.<br><br>There is a gap of 5 cycles between the write-back stage and the retire event.<br>That is because instructions must retire in program order, so [1,0] has to wait<br>for [0, 2] to be retired first (i.e it has to wait unti cycle 10).<br><br>In the dot-product example, all instructions are in a RAW (Read After Write)<br>dependency chain.  Register %xmm2 written by the vmulps is immediately used by<br>the first vhaddps, and register %xmm3 written by the first vhaddps is used by<br>the second vhaddps.  Long data dependencies negatively affect the ILP<br>(Instruction Level Parallelism).<br><br>In the dot-product example, there are anti-dependencies introduced by<br>instructions from different iterations.  However, those dependencies can be<br>removed at register renaming stage (at the cost of allocating register aliases,<br>and therefore consuming temporary registers).<br><br>Table "Average Wait times" helps diagnose performance issues that are caused by<br>the presence of long latency instructions and potentially long data dependencies<br>which may limit the ILP.  Note that the tool by default assumes at least 1cy<br>between the dispatch event and the issue event.<br><br>When the performance is limited by data dependencies and/or long latency<br>instructions, the number of cycles spent while in the "ready" state is expected<br>to be very small when compared with the total number of cycles spent in the<br>scheduler's queue.  So the difference between the two counters is a good<br>indicator of how big of an impact data dependencies had on the execution of<br>instructions.  When performance is mostly limited by the lack of hardware<br>resources, the delta between the two counters is small.  However, the number of<br>cycles spent in the queue tends to be bigger (i.e. more than 1-3cy) especially<br>when compared with other low latency instructions.<br><br>Extra statistics to further diagnose performance issues.<br>--------------------------------------------------------<br><br>Flag '-verbose' enables extra statistics and performance counters for the<br>dispatch logic, the reorder buffer, the retire control unit and the register<br>file.<br><br>Below is an example of verbose output generated by the tool for the dot-product<br>example discussed in the previous sections.<br><br>///////////////////<br>Iterations:     300<br>Instructions:   900<br>Total Cycles:   610<br>Dispatch Width: 2<br>IPC:            1.48<br><br><br>Dynamic Dispatch Stall Cycles:<br>RAT     - Register unavailable:                      0<br>RCU     - Retire tokens unavailable:                 0<br>SCHEDQ  - Scheduler full:                            272<br>LQ      - Load queue full:                           0<br>SQ      - Store queue full:                          0<br>GROUP   - Static restrictions on the dispatch group: 0<br><br><br>Register Alias Table:<br>Total number of mappings created: 900<br>Max number of mappings used:      35<br><br><br>Dispatch Logic - number of cycles where we saw N instructions dispatched:<br>[# dispatched], [# cycles]<br> 0,              24  (3.9%)<br> 1,              272  (44.6%)<br> 2,              314  (51.5%)<br><br><br>Schedulers - number of cycles where we saw N instructions issued:<br>[# issued], [# cycles]<br> 0,          7  (1.1%)<br> 1,          306  (50.2%)<br> 2,          297  (48.7%)<br><br><br>Retire Control Unit - number of cycles where we saw N instructions retired:<br>[# retired], [# cycles]<br> 0,           109  (17.9%)<br> 1,           102  (16.7%)<br> 2,           399  (65.4%)<br><br><br>Scheduler's queue usage:<br>JALU01,  0/20<br>JFPU01,  18/18<br>JLSAGU,  0/12<br>///////////////////<br><br>Based on the verbose report, the backend was only able to dispatch two<br>instructions 51.5% of the time.  The dispatch group was limited to one<br>instruction 44.6% of the cycles, which corresponds to 272 cycles.<br><br>If we look at section "Dynamic Dispatch Stall Cycles", we can see how counter<br>SCHEDQ reports 272 cycles.  Counter SCHEDQ is incremented every time the dispatch<br>logic is unable to dispatch a full group of two instructions because the<br>scheduler's queue is full.<br><br>Section "Scheduler's queue usage" shows how the maximum number of buffer entries<br>(i.e. scheduler's queue entries) used at runtime for resource JFPU01 reached its<br>maximum. Note that AMD Jaguar implements three schedulers:<br>  * JALU01 - A scheduler for ALU instructions<br>  * JLSAGU - A scheduler for address generation<br>  * JFPU01 - A scheduler floating point operations.<br><br>The dot-product is a kernel of three floating point instructions (a vector<br>multiply followed by two horizontal adds).  That explains why only the floating<br>point scheduler appears to be used according to section "Scheduler's queue<br>usage".<br><br>A full scheduler's queue is either caused by data dependency chains, or by a<br>sub-optimal usage of hardware resources.  Sometimes, resource pressure can be<br>mitigated by rewriting the kernel using different instructions that consume<br>different scheduler resources.  Schedulers with a small queue are less resilient<br>to bottlenecks caused by the presence of long data dependencies.<br><br>In this example, we can conclude that the IPC is mostly limited by data<br>dependencies, and not by resource pressure.<br><br>LLVM-MCA instruction flow<br>-------------------------<br><br>This section describes the instruction flow through the out-of-order backend, as<br>well as the functional units involved in the process. <br><br>An instruction goes through a default sequence of stages:<br>    - Dispatch (Instruction is dispatched to the schedulers).<br>    - Issue (Instruction is issued to the processor pipelines).<br>    - Write Back (Instruction is executed, and results are written back).<br>    - Retire (Instruction is retired; writes are architecturally committed).<br><br>The tool only models the out-of-order portion of a processor. Therefore, the<br>instruction fetch and decode stages are not modeled. Performance bottlenecks in<br>the frontend are not diagnosed by this tool.  The tool assumes that instructions<br>have all been decoded and placed in a queue.  Also, the tool doesn't know<br>anything about branch prediction.<br><br>Instruction Dispatch<br>--------------------<br><br>During the Dispatch stage, instructions are picked in program order from a queue<br>of already decoded instructions, and dispatched in groups to the hardware<br>schedulers.  The dispatch logic is implemented by class DispatchUnit in file<br>Dispatch.h.<br><br>The size of a dispatch group depends on the availability of hardware resources,<br>and it cannot exceed the value of field 'DispatchWidth' in class DispatchUnit.<br>Note that field DispatchWidth defaults to the value of field 'IssueWidth' from<br>the scheduling model.<br><br>Users can override the DispatchWidth value with flag "-dispatch=<N>" (where 'N'<br>is an unsigned quantity).<br><br>An instruction can be dispatched if:<br> - The size of the dispatch group is smaller than DispatchWidth<br> - There are enough entries in the reorder buffer<br> - There are enough temporary registers to do register renaming<br> - Schedulers are not full.<br><br>Scheduling models don't describe register files, and therefore the tool doesn't<br>know if there is more than one register file, and how many temporaries are<br>available for register renaming.<br><br>By default, the tool (optimistically) assumes a single register file with an<br>unbounded number of temporary registers.  Users can limit the number of<br>temporary registers available for register renaming using flag<br>`-register-file-size=<N>`, where N is the number of temporaries.  A value of<br>zero for N means 'unbounded'.  Knowing how many temporaries are available for<br>register renaming, the tool can predict dispatch stalls caused by the lack of<br>temporaries.<br><br>The number of reorder buffer entries consumed by an instruction depends on the<br>number of micro-opcodes it specifies in the target scheduling model (see field<br>'NumMicroOpcodes' of tablegen class ProcWriteResources and its derived classes;<br>TargetSchedule.td).<br><br>The reorder buffer is implemented by class RetireControlUnit (see Dispatch.h).<br>Its goal is to track the progress of instructions that are "in-flight", and<br>retire instructions in program order.  The number of entries in the reorder<br>buffer defaults to the value of field 'MicroOpBufferSize' from the target<br>scheduling model.<br><br>Instructions that are dispatched to the schedulers consume scheduler buffer<br>entries.  The tool queries the scheduling model to figure out the set of<br>buffered resources consumed by an instruction.  Buffered resources are treated<br>like "scheduler" resources, and the field 'BufferSize' (from the processor<br>resource tablegen definition) defines the size of the scheduler's queue.<br><br>Zero latency instructions (for example NOP instructions) don't consume scheduler<br>resources.  However, those instructions still reserve a number of slots in the<br>reorder buffer.<br><br>Instruction Issue<br>-----------------<br><br>As mentioned in the previous section, each scheduler resource implements a queue<br>of instructions.  An instruction has to wait in the scheduler's queue until<br>input register operands become available.  Only at that point, does the<br>instruction becomes eligible for execution and may be issued (potentially<br>out-of-order) to a pipeline for execution.<br><br>Instruction latencies can be computed by the tool with the help of the<br>scheduling model; latency values are defined by the scheduling model through<br>ProcWriteResources objects.<br><br>Class Scheduler (see file Scheduler.h) knows how to emulate multiple processor<br>schedulers.  A Scheduler is responsible for tracking data dependencies, and<br>dynamically select which processor resources are consumed/used by instructions.<br><br>Internally, the Scheduler class delegates the management of processor resource<br>units and resource groups to the ResourceManager class.  ResourceManager is also<br>responsible for selecting resource units that are effectively consumed by<br>instructions.  For example, if an instruction consumes 1cy of a resource group,<br>the ResourceManager object selects one of the available units from the group; by<br>default, it uses a round-robin selector to guarantee that resource usage is<br>uniformly distributed between all units of a group.<br><br>Internally, class Scheduler implements three instruction queues:<br>  - WaitQueue: a queue of instructions whose operands are not ready yet.<br>  - ReadyQueue: a queue of instructions ready to execute.<br>  - IssuedQueue: a queue of instructions executing.<br><br>Depending on the operands availability, instructions that are dispatched to the<br>Scheduler are either placed into the WaitQueue or into the ReadyQueue.<br><br>Every cycle, class Scheduler checks if instructions can be moved from the<br>WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be<br>issued to the underlying pipelines.  The algorithm prioritizes older<br>instructions over younger instructions.<br><br>Objects of class ResourceState (see Scheduler.h) describe processor resources.<br>There is an instance of class ResourceState for each single processor resource<br>specified by the scheduling model.  A ResourceState object for a processor<br>resource with multiple units dynamically tracks the availability of every single<br>unit.  For example, the ResourceState of a resource group tracks the<br>availability of every resource in that group.  Internally, ResourceState<br>implements a round-robin selector to dynamically pick the next unit to use from<br>the group.<br><br>Write-Back and Retire Stage<br>---------------------------<br><br>Issued instructions are moved from the ReadyQueue to the IssuedQueue.  There,<br>instructions wait until they reach the write-back stage.  At that point, they<br>get removed from the queue and the retire control unit is notified.<br><br>On the event of "instruction executed", the retire control unit flags the<br>instruction as "ready to retire".<br><br>Instruction are retired in program order; an "instruction retired" event is sent<br>to the register file which frees the temporary registers allocated for the<br>instruction at register renaming stage.<br><br>Load/Store Unit and Memory Consistency Model<br>--------------------------------------------<br><br>The tool attempts to emulate out-of-order execution of memory operations.  Class<br>LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues for<br>speculative execution of loads and stores.<br> <br>Each load (or store) consumes an entry in the load (or store) queue.  The number<br>of slots in the load/store queues is unknown by the tool, since there is no<br>mention of it in the scheduling model.  In practice, users can specify flag<br>`-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue to be<br>equal to exactly N (an unsigned value). If N is zero, then the tool assumes an<br>unbounded queue (this is the default).<br><br>LSUnit implements a relaxed consistency model for memory loads and stores. The<br>rules are:<br>1) A younger load is allowed to pass an older load only if there is no<br>   intervening store in between the two loads.<br>2) An younger store is not allowed to pass an older store.<br>3) A younger store is not allowed to pass an older load.<br>4) A younger load is allowed to pass an older store provided that the load does<br>   not alias with the store.<br><br>By default, this class conservatively (i.e. pessimistically) assumes that loads<br>always may-alias store operations.  Essentially, this LSUnit doesn't perform any<br>sort of alias analysis to rule out cases where loads and stores don't overlap<br>with each other.  The downside of this approach however is that younger loads are<br>never allowed to pass older stores.  To make it possible for a younger load to<br>pass an older store, users can use the command line flag -noalias.  Under<br>'noalias', a younger load is always allowed to pass an older store.<br><br>Note that, in the case of write-combining memory, rule 2. could be relaxed a bit<br>to allow reordering of non-aliasing store operations.  That being said, at the<br>moment, there is no way to further relax the memory model (flag -noalias is the<br>only option).  Essentially, there is no option to specify a different memory<br>type (for example: write-back, write-combining, write-through; etc.) and<br>consequently to weaken or strengthen the memory model.<br><br>Other limitations are:<br> * LSUnit doesn't know when store-to-load forwarding may occur.<br> * LSUnit doesn't know anything about the cache hierarchy and memory types.<br> * LSUnit doesn't know how to identify serializing operations and memory fences.<br><br>No assumption is made on the store buffer size.  As mentioned before, LSUnit<br>conservatively assumes a may-alias relation between loads and stores, and it<br>doesn't attempt to identify cases where store-to-load forwarding would occur in<br>practice.<br><br>LSUnit doesn't attempt to predict whether a load or store hits or misses the L1<br>cache.  It only knows if an instruction "MayLoad" and/or "MayStore".  For loads,<br>the scheduling model provides an "optimistic" load-to-use latency (which usually<br>matches the load-to-use latency for when there is a hit in the L1D).<br><br>Class MCInstrDesc in LLVM doesn't know about serializing operations, nor<br>memory-barrier like instructions.  LSUnit conservatively assumes that an<br>instruction which has both 'MayLoad' and 'UnmodeledSideEffects' behaves like a<br>"soft" load-barrier.  That means, it serializes loads without forcing a flush of<br>the load queue.  Similarly, instructions flagged with both 'MayStore' and<br>'UnmodeledSideEffects' are treated like store barriers.  A full memory barrier<br>is a 'MayLoad' and 'MayStore' instruction with 'UnmodeledSideEffects'.  This is<br>inaccurate, but it is the best that we can do at the moment with the current<br>information available in LLVM.<br><br>A load/store barrier consumes one entry of the load/store queue.  A load/store<br>barrier enforces ordering of loads/stores.  A younger load cannot pass a load<br>barrier.  Also, a younger store cannot pass a store barrier.  A younger load has<br>to wait for the memory/load barrier to execute.  A load/store barrier is<br>"executed" when it becomes the oldest entry in the load/store queue(s). That<br>also means, by construction, all the older loads/stores have been executed.<br><br>In conclusion the full set of rules is:<br>  1. A store may not pass a previous store.<br>  2. A load may not pass a previous store unless flag 'NoAlias' is set.<br>  3. A load may pass a previous load.<br>  4. A store may not pass a previous load (regardless of flag 'NoAlias').<br>  5. A load has to wait until an older load barrier is fully executed.<br>  6. A store has to wait until an older store barrier is fully executed.<br><br>Known limitations<br>-----------------<br>Previous sections described cases where the tool is missing information to give<br>an accurate report.  For example, the first sections of this document explained<br>how the lack of knowledge about the processor negatively affects the performance<br>analysis.  The lack of knowledge is often a consequence of how scheduling models<br>are defined; as mentioned before, scheduling models intentionally don't describe<br>processors in fine details.<br><br>The accuracy of the performance analysis is also affected by assumptions made by<br>the processor model used by the tool.<br><br>Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache in<br>the hardware frontend to speedup the throughput in the presence of tight loops.<br>The presence of these buffers complicates the decoding logic, and requires<br>knowledge on the branch predictor too.  Class 'SchedMachineModel' in tablegen<br>provides a field named 'LoopMicroOpBufferSize' which is used to describe loop<br>buffers.  However, the purpose of that field is to enable loop unrolling of<br>tight loops; essentially, it affects the cost model used by pass loop-unroll.<br><br>By design, the tool only cares about the out-of-order portion of a processor,<br>and consequently doesn't try to predict the frontend throughput.  Processors may<br>implement complex decoding schemes; statically predicting the frontend<br>throughput is in general beyond the scope of this tool.  For the same reasons,<br>this tool intentionally doesn't model branch prediction.  That being said, this<br>tool could be definitely extended in future to also account for the hardware<br>frontend when doing performance analysis.  This would inevitably require extra<br>(extensive) processor knowledge related to all the available decoding paths in<br>the hardware frontend.<br><br>When computing the IPC, the tool assumes a zero-latency "perfect" fetch&decode<br>stage; the full sequence of decoded instructions is immediately visible to the<br>dispatch logic from the start.<br><br>The tool doesn't know about simultaneous mutithreading.  According to the tool,<br>processor resources are not statically/dynamically partitioned.  Processor<br>resources are fully available to the hardware thread executing the<br>microbenchmark.<br><br>The execution model implemented by this tool assumes that instructions are<br>firstly dispatched in groups to hardware schedulers, and then issued to<br>pipelines for execution.  The model assumes dynamic scheduling of instructions.<br>Instructions are placed in a queue and potentially executed out-of-order (based<br>on the operand availability). The dispatch stage is definitely distinct from the<br>issue stage.<br><br>This model doesn't correctly describe processors where the dispatch/issue is a<br>single stage. This is what happens for example in VLIW processors, where<br>instructions are packaged and statically scheduled at compile time; it is up to<br>the compiler to predict the latency of instructions and package issue groups<br>accordingly. For such targets, there is no dynamic scheduling done by the<br>hardware.<br><br>Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted to<br>support processors with a single dispatch/issue stage. The execution flow would<br>require some changes in the way how existing components (i.e.  DispatchUnit,<br>Scheduler, etc.) interact. This can be a future development.<br><br>The following sections describes other known limitations.  The goal is not to<br>provide an extensive list of limitations; we want to report what we believe are<br>the most important limitations, and suggest possible methods to overcome them.<br><br>Load/Store barrier instructions and serializing operations<br>----------------------------------------------------------<br>Section "Load/Store Unit and Memory Consistency Model" already mentioned how<br>LLVM doesn't know about serializing operations and memory barriers.  Most of it<br>boils down to the fact that class MCInstrDesc (intentionally) doesn't expose<br>those properties.  Instead, both serializing operations and memory barriers<br>"have side-effects" according to MCInstrDesc.  That is because, at least for<br>scheduling purposes, knowing that an instruction has unmodeled side effects is<br>often enough to treat the instruction like a compiler scheduling barrier.<br><br>A performance analysis tool could use the extra knowledge on barriers and<br>serializing operations to generate a more accurate performance report.  One way<br>to improve this is by reserving a couple of bits in field 'Flags' from class<br>MCInstrDesc: one bit for barrier operations, and another bit to mark<br>instructions as serializing operations.<br><br>Lack of support for instruction itineraries<br>-------------------------------------------<br>The current version of the tool doesn't know how to process instruction<br>itineraries.  This is probably one of the most important limitations, since it<br>affects a few out-of-order processors in LLVM.<br><br>As mentioned in section 'Instruction Issue', class Scheduler delegates to an<br>instance of class ResourceManager the handling of processor resources.<br>ResourceManager is where most of the scheduling logic is implemented.<br><br>Adding support for instruction itineraries requires that we teach<br>ResourceManager how to handle functional units and instruction stages.  This<br>development can be a future extension, and it would probably require a few<br>changes to the ResourceManager interface.<br><br>Instructions that affect control flow are not correctly modeled<br>---------------------------------------------------------------<br>Examples of instructions that affect the control flow are: return, indirect<br>branches, calls, etc.  The tool doesn't try to predict/evaluate branch targets.<br>In particular, the tool doesn't model any sort of branch prediction, nor does it<br>attempt to track changes to the program counter.  The tool always assumes that<br>the input assembly sequence is the body of a microbenchmark (a simple loop<br>executed for a number of iterations). The "next" instruction in sequence is<br>always the next instruction to dispatch.<br><br>Call instructions default to an arbitrary high latency of 100cy. A warning is<br>generated if the tool encounters a call instruction in the sequence.  Return<br>instructions are not evaluated, and therefore control flow is not affected.<br>However, the tool still queries the processor scheduling model to obtain latency<br>information for instructions that affect the control flow.<br><br>Possible extensions to the scheduling model<br>-------------------------------------------<br>Section "Instruction Dispatch" explained how the tool doesn't know about the<br>register files, and temporaries available in each register file for register<br>renaming purposes.<br><br>The LLVM scheduling model could be extended to better describe register files.<br>Ideally, scheduling model should be able to define:<br> - The size of each register file<br> - How many temporary registers are available for register renaming<br> - How register classes map to register files<br><br>The scheduling model doesn't specify the retire throughput (i.e. how many<br>instructions can be retired every cycle).  Users can specify flag<br>`-max-retire-per-cycle=<uint>` to limit how many instructions the retire control<br>unit can retire every cycle.  Ideally, every processor should be able to specify<br>the retire throughput (for example, by adding an extra field to the scheduling<br>model tablegen class).<br><br>Known limitations on X86 processors<br>-----------------------------------<br><br>1) Partial register updates versus full register updates.<br><br>On x86-64, a 32-bit GPR write fully updates the super-register. Example:<br>      add %edi %eax    ## eax += edi<br><br>Here, register %eax aliases the lower half of 64-bit register %rax. On x86-64,<br>register %rax is fully updated by the 'add' (the upper half of %rax is zeroed).<br>Essentially, it "kills" any previous definition of (the upper half of) register<br>%rax.<br><br>On the other hand, 8/16 bit register writes only perform a so-called "partial<br>register update". Example:<br>      add %di, %ax     ## ax += di<br><br>Here, register %eax is only partially updated. To be more specific, the lower<br>half of %eax is set, and the upper half is left unchanged. There is also no<br>change in the upper 48 bits of register %rax.<br><br>To get accurate performance analysis, the tool has to know which instructions<br>perform a partial register update, and which instructions fully update the<br>destination's super-register.<br><br>One way to expose this information is (again) via tablegen.  For example, we<br>could add a flag in the tablegen instruction class to tag instructions that<br>perform partial register updates. Something like this: 'bit<br>hasPartialRegisterUpdate = 1'. However, this would force a `let<br>hasPartialRegisterUpdate = 0` on several instruction definitions.<br><br>Another approach is to have a MCSubtargetInfo hook similar to this:<br>    virtual bool updatesSuperRegisters(unsigned short opcode) { return false; }<br><br>Targets will be able to override this method if needed.  Again, this is just an<br>idea. But the plan is to have this fixed as a future development.<br><br>2) Macro Op fusion.<br><br>The tool doesn't know about macro-op fusion. On modern x86 processors, a<br>'cmp/test' followed by a 'jmp' is fused into a single macro operation.  The<br>advantage is that the fused pair only consumes a single slot in the dispatch<br>group. <br><br>As a future development, the tool should be extended to address macro-fusion.<br>Ideally, we could have LLVM generate a table enumerating all the opcode pairs<br>that can be fused together. That table could be exposed to the tool via the<br>MCSubtargetInfo interface.  This is just an idea; there may be better ways to<br>implement this.<br><br>3) Intel processors: mixing legacy SSE with AVX instructions.<br><br>On modern Intel processors with AVX, mixing legacy SSE code with AVX code<br>negatively impacts the performance.  The tool is not aware of this issue, and<br>the performance penalty is not accounted when doing the analysis. This is<br>something that we would like to improve in future.<br><br>4) Zero-latency register moves and Zero-idioms.<br><br>Most modern AMD/Intel processors know how to optimize out register-register<br>moves and zero idioms at register renaming stage. The tool doesn't know<br>about these patterns, and this may negatively impact the performance analysis.<br><br>Known design problems<br>---------------------<br>This section describes two design issues that are currently affecting the tool.<br>The long term plan is to "fix" these issues.<br><br>1) Variant instructions not correctly modeled.<br><br>The tool doesn't know how to analyze instructions with a "variant" scheduling<br>class descriptor. A variant scheduling class needs to be resolved dynamically.<br>The "actual" scheduling class often depends on the subtarget, as well as<br>properties of the specific MachineInstr object.<br><br>Unfortunately, the tool manipulates MCInst, and it doesn't know anything about<br>MachineInstr. As a consequence, the tool cannot use the existing machine<br>subtarget hooks that are normally used to resolve the variant scheduling class.<br>This is a major design issue which mostly affects ARM/AArch64 targets.  It<br>mostly boils down to the fact that the existing scheduling framework was meant<br>to work for MachineInstr.<br><br>When the tool encounters a "variant" instruction, it assumes a generic 1cy<br>latency. However, the tool would not be able to tell which processor resources<br>are effectively consumed by the variant instruction.<br><br>2) MCInst and MCInstrDesc.<br><br>Performance analysis tools require data dependency information to correctly<br>predict the runtime performance of the code. This tool must always be able to<br>obtain the set of implicit/explicit register defs/uses for every instruction of<br>the input assembly sequence.<br><br>In the first section of this document, it was mentioned how the tool takes as<br>input an assembly sequence. That sequence is parsed into a MCInst sequence with<br>the help of assembly parsers available from the targets.<br><br>A MCInst is a very low-level instruction representation. The tool can inspect<br>the MCOperand sequence of an MCInst to identify register operands. However,<br>there is no way to tell register operands that are definitions from register<br>operands that are uses.<br><br>In LLVM, class MCInstrDesc is used to fully describe target instructions and<br>their operands. The opcode of a machine instruction (a MachineInstr object) can<br>be used to query the instruction set through method `MCInstrInfo::get' to obtain<br>the associated MCInstrDesc object.<br><br>However class MCInstrDesc describes properties and operands of MachineInstr<br>objects. Essentially, MCInstrDesc is not meant to be used to describe MCInst<br>objects.  To be more specific, MCInstrDesc objects are automatically generated<br>via tablegen from the instruction set description in the target .td files.  For<br>example, field `MCInstrDesc::NumDefs' is always equal to the cardinality of the<br>`(outs)` set from the tablegen instruction definition.<br><br>By construction, register definitions always appear at the beginning of the<br>MachineOperands list in MachineInstr. Basically, the (outs) are the first<br>operands of a MachineInstr, and the (ins) will come after in the machine operand<br>list. Knowing the number of register definitions is enough to identify<br>all the register operands that are definitions.<br><br>In a normal compilation process, MCInst objects are generated from MachineInstr<br>objects through a lowering step. By default the lowering logic simply iterates<br>over the machine operands of a MachineInstr, and converts/expands them into<br>equivalent MCOperand objects.<br><br>The default lowering strategy has the advantage of preserving all of the<br>above mentioned assumptions on the machine operand sequence. That means, register<br>definitions would still be at the beginning of the MCOperand sequence, and<br>register uses would come after.<br><br>Targets may still define custom lowering routines for specific opcodes. Some of<br>these routines may lower operands in a way that potentially breaks (some of) the<br>assumptions on the machine operand sequence which were valid for MachineInstr.<br>Luckily, this is not the most common form of lowering done by the targets, and<br>the vast majority of the MachineInstr are lowered based on the default strategy<br>which preserves the original machine operand sequence.  This is especially true<br>for x86, where the custom lowering logic always preserves the original (i.e.<br>from the MachineInstr) operand sequence.<br><br>This tool currently works under the strong (and potentially incorrect)<br>assumption that register def/uses in a MCInst can always be identified by<br>querying the machine instruction descriptor for the opcode. This assumption made<br>it possible to develop this tool and get good numbers at least for the<br>processors available in the x86 backend.<br><br>That being said, the analysis is still potentially incorrect for other targets.<br>So we plan (with the help of the community) to find a proper mechanism to map<br>when possible MCOperand indices back to MachineOperand indices of the equivalent<br>MachineInstr.  This would be equivalent to describing changes made by the<br>lowering step which affected the operand sequence. For example, we could have an<br>index for every register MCOperand (or -1, if the operand didn't exist in the<br>original MachineInstr). The mapping could look like this <0,1,3,2>.  Here,<br>MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.<br><br>This information could be automatically generated via tablegen for all the<br>instructions whose custom lowering step breaks assumptions made by the tool on<br>the register operand sequence (In general, these instructions should be the<br>minority of a target's instruction set). Unfortunately, we don't have that<br>information now.  As a consequence, we assume that the number of explicit<br>register definitions is the same number specified in MCInstrDesc.  We also<br>assume that register definitions always come first in the operand sequence.<br><br>In conclusion: these are for now the strong assumptions made by the tool:<br>  * The number of explicit and implicit register definitions in a MCInst<br>    matches the number of explicit and implicit definitions specified by the<br>    MCInstrDesc object.<br>  * Register uses always come after register definitions.<br>  * If an opcode specifies an optional definition, then the optional<br>    definition is always the last register operand in the sequence.<br><br>Note that some of the information accessible from the MCInstrDesc is always<br>valid for MCInst. For example: implicit register defs, implicit register uses<br>and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still apply to<br>MCInst. The tool knows about this, and uses that information during its<br>analysis.<br><br>What to do next<br>---------------<br>The source code has been uploaded for review on phabricator at this link: <a href="https://reviews.llvm.org/D43951">https://reviews.llvm.org/D43951</a>.<br><br>The review covers two patches:<br>A first (very small) patch that always enables the generation of processor<br>resource names in the SubtargetEmitter. Currently, the processor resource names<br>are only generated for debugging purposes, but are needed by the tool to<br>generate user friendly reports, so we would like to always generate them.<br>A second patch with the actual static analysis tool (in llvm/tools).<br><br>Once these first two patches are committed, the plan is to keep working on the<br>tool with the help of the community to address all of the limitations described<br>by the previous sections, and find good solutions/fixes for the design issues<br>described by section "Known design problems".<br><br>We hope the community will find this tool useful like we have.<br><br>Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who really<br>helped me a lot by suggesting improvements and testing the tool.<br><br>Thanks for your time.<br>-Andrea<br><br></div>