[llvm] r337496 - [llvm-mca][docs] Add Timeline and How MCA works.
Matt Davis via llvm-commits
llvm-commits at lists.llvm.org
Thu Jul 19 13:33:59 PDT 2018
Author: mattd
Date: Thu Jul 19 13:33:59 2018
New Revision: 337496
URL: http://llvm.org/viewvc/llvm-project?rev=337496&view=rev
Log:
[llvm-mca][docs] Add Timeline and How MCA works.
For the most part, these changes were from the RFC. I made a few minor
word/structure changes, but nothing significant. I also regenerated the
example output, and adjusted the text accordingly.
Differential Revision: https://reviews.llvm.org/D49527
Modified:
llvm/trunk/docs/CommandGuide/llvm-mca.rst
Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=337496&r1=337495&r2=337496&view=diff
==============================================================================
--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Thu Jul 19 13:33:59 2018
@@ -21,9 +21,9 @@ The main goal of this tool is not just t
when run on the target, but also help with diagnosing potential performance
issues.
-Given an assembly code sequence, llvm-mca estimates the IPC (Instructions Per
-Cycle), as well as hardware resource pressure. The analysis and reporting style
-were inspired by the IACA tool from Intel.
+Given an assembly code sequence, llvm-mca estimates the IPC, as well as
+hardware resource pressure. The analysis and reporting style were inspired by
+the IACA tool from Intel.
:program:`llvm-mca` allows the usage of special code comments to mark regions of
the assembly code to be analyzed. A comment starting with substring
@@ -207,3 +207,223 @@ EXIT STATUS
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
to standard error, and the tool returns 1.
+HOW MCA WORKS
+-------------
+
+MCA takes assembly code as input. The assembly code is parsed into a sequence
+of MCInst with the help of the existing LLVM target assembly parsers. The
+parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
+a performance report.
+
+The Pipeline module simulates the execution of the machine code sequence in a
+loop of iterations (default is 100). During this process, the pipeline collects
+a number of execution related statistics. At the end of this process, the
+pipeline generates and prints a report from the collected statistics.
+
+Here is an example of a performance report generated by MCA for a dot-product
+of two packed float vectors of four elements. The analysis is conducted for
+target x86, cpu btver2. The following result can be produced via the following
+command using the example located at
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
+
+.. code-block:: bash
+
+ $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
+
+.. code-block:: none
+
+ Iterations: 300
+ Instructions: 900
+ Total Cycles: 610
+ Dispatch Width: 2
+ IPC: 1.48
+ Block RThroughput: 2.0
+
+
+ Instruction Info:
+ [1]: #uOps
+ [2]: Latency
+ [3]: RThroughput
+ [4]: MayLoad
+ [5]: MayStore
+ [6]: HasSideEffects (U)
+
+ [1] [2] [3] [4] [5] [6] Instructions:
+ 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
+ 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
+ 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
+
+
+ Resources:
+ [0] - JALU0
+ [1] - JALU1
+ [2] - JDiv
+ [3] - JFPA
+ [4] - JFPM
+ [5] - JFPU0
+ [6] - JFPU1
+ [7] - JLAGU
+ [8] - JMul
+ [9] - JSAGU
+ [10] - JSTC
+ [11] - JVALU0
+ [12] - JVALU1
+ [13] - JVIMUL
+
+
+ Resource pressure per iteration:
+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
+ - - - 2.00 1.00 2.00 1.00 - - - - - - -
+
+ Resource pressure by instruction:
+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
+ - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
+ - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
+ - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
+
+According to this report, the dot-product kernel has been executed 300 times,
+for a total of 900 dynamically executed instructions.
+
+The report is structured in three main sections. The first section collects a
+few performance numbers; the goal of this section is to give a very quick
+overview of the performance throughput. In this example, the two important
+performance indicators are the predicted total number of cycles, and the
+Instructions Per Cycle (IPC). IPC is probably the most important throughput
+indicator. A big delta between the Dispatch Width and the computed IPC is an
+indicator of potential performance issues.
+
+The second section of the report shows the latency and reciprocal
+throughput of every instruction in the sequence. That section also reports
+extra information related to the number of micro opcodes, and opcode properties
+(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
+
+The third section is the *Resource pressure view*. This view reports
+the average number of resource cycles consumed every iteration by instructions
+for every processor resource unit available on the target. Information is
+structured in two tables. The first table reports the number of resource cycles
+spent on average every iteration. The second table correlates the resource
+cycles to the machine instruction in the sequence. For example, every iteration
+of the instruction vmulps always executes on resource unit [6]
+(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
+per iteration. Note that on Jaguar, vector floating-point multiply can only be
+issued to pipeline JFPU1, while horizontal floating-point additions can only be
+issued to pipeline JFPU0.
+
+The resource pressure view helps with identifying bottlenecks caused by high
+usage of specific hardware resources. Situations with resource pressure mainly
+concentrated on a few resources should, in general, be avoided. Ideally,
+pressure should be uniformly distributed between multiple resources.
+
+Timeline View
+^^^^^^^^^^^^^
+MCA's timeline view produces a detailed report of each instruction's state
+transitions through an instruction pipeline. This view is enabled by the
+command line option ``-timeline``. As instructions transition through the
+various stages of the pipeline, their states are depicted in the view report.
+These states are represented by the following characters:
+
+* D : Instruction dispatched.
+* e : Instruction executing.
+* E : Instruction executed.
+* R : Instruction retired.
+* = : Instruction already dispatched, waiting to be executed.
+* \- : Instruction executed, waiting to be retired.
+
+Below is the timeline view for a subset of the dot-product example located in
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
+MCA using the following command:
+
+.. code-block:: bash
+
+ $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
+
+.. code-block:: none
+
+ Timeline view:
+ 012345
+ Index 0123456789
+
+ [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
+ [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
+ [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
+ [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
+ [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
+ [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
+ [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
+ [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
+ [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
+
+
+ Average Wait times (based on the timeline view):
+ [0]: Executions
+ [1]: Average time spent waiting in a scheduler's queue
+ [2]: Average time spent waiting in a scheduler's queue while ready
+ [3]: Average time elapsed from WB until retire stage
+
+ [0] [1] [2] [3]
+ 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
+ 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
+ 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
+
+The timeline view is interesting because it shows instruction state changes
+during execution. It also gives an idea of how MCA processes instructions
+executed on the target, and how their timing information might be calculated.
+
+The timeline view is structured in two tables. The first table shows
+instructions changing state over time (measured in cycles); the second table
+(named *Average Wait times*) reports useful timing statistics, which should
+help diagnose performance bottlenecks caused by long data dependencies and
+sub-optimal usage of hardware resources.
+
+An instruction in the timeline view is identified by a pair of indices, where
+the first index identifies an iteration, and the second index is the
+instruction index (i.e., where it appears in the code sequence). Since this
+example was generated using 3 iterations: ``-iterations=3``, the iteration
+indices range from 0-2 inclusively.
+
+Excluding the first and last column, the remaining columns are in cycles.
+Cycles are numbered sequentially starting from 0.
+
+From the example output above, we know the following:
+
+* Instruction [1,0] was dispatched at cycle 1.
+* Instruction [1,0] started executing at cycle 2.
+* Instruction [1,0] reached the write back stage at cycle 4.
+* Instruction [1,0] was retired at cycle 10.
+
+Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
+scheduler's queue for the operands to become available. By the time vmulps is
+dispatched, operands are already available, and pipeline JFPU1 is ready to
+serve another instruction. So the instruction can be immediately issued on the
+JFPU1 pipeline. That is demonstrated by the fact that the instruction only
+spent 1cy in the scheduler's queue.
+
+There is a gap of 5 cycles between the write-back stage and the retire event.
+That is because instructions must retire in program order, so [1,0] has to wait
+for [0,2] to be retired first (i.e., it has to wait until cycle 10).
+
+In the example, all instructions are in a RAW (Read After Write) dependency
+chain. Register %xmm2 written by vmulps is immediately used by the first
+vhaddps, and register %xmm3 written by the first vhaddps is used by the second
+vhaddps. Long data dependencies negatively impact the ILP (Instruction Level
+Parallelism).
+
+In the dot-product example, there are anti-dependencies introduced by
+instructions from different iterations. However, those dependencies can be
+removed at register renaming stage (at the cost of allocating register aliases,
+and therefore consuming temporary registers).
+
+Table *Average Wait times* helps diagnose performance issues that are caused by
+the presence of long latency instructions and potentially long data dependencies
+which may limit the ILP. Note that MCA, by default, assumes at least 1cy
+between the dispatch event and the issue event.
+
+When the performance is limited by data dependencies and/or long latency
+instructions, the number of cycles spent while in the *ready* state is expected
+to be very small when compared with the total number of cycles spent in the
+scheduler's queue. The difference between the two counters is a good indicator
+of how large of an impact data dependencies had on the execution of the
+instructions. When performance is mostly limited by the lack of hardware
+resources, the delta between the two counters is small. However, the number of
+cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
+especially when compared to other low latency instructions.
More information about the llvm-commits
mailing list