[llvm] r367853 - [MCA][doc] Add a section for the 'Bottleneck Analysis'.

Mon Aug 5 06:18:37 PDT 2019

Author: adibiagio
Date: Mon Aug  5 06:18:37 2019
New Revision: 367853

URL: http://llvm.org/viewvc/llvm-project?rev=367853&view=rev
Log:
[MCA][doc] Add a section for the 'Bottleneck Analysis'.

Also clarify the meaning of 'Block RThroughput' and 'RThroughput'.

Modified:
    llvm/trunk/docs/CommandGuide/llvm-mca.rst

Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=367853&r1=367852&r2=367853&view=diff
==============================================================================

--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Mon Aug  5 06:18:37 2019
@@ -373,17 +373,28 @@ overview of the performance throughput.
 **IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
 Throughput).
 
+Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
+to the out-of-order backend every simulated cycle.
+
 IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. In the absence of loop-carried data dependencies, the
-observed IPC tends to a theoretical maximum which can be computed by dividing
-the number of instructions of a single iteration by the *Block RThroughput*.
+number of cycles.
+
+Field *Block RThroughput* is the reciprocal of the block throughput. Block
+throuhgput is a theoretical quantity computed as the maximum number of blocks
+(i.e. iterations) that can be executed per simulated clock cycle in the absence
+of loop carried dependencies. Block throughput is is superiorly
+limited by the dispatch rate, and the availability of hardware resources.
+
+In the absence of loop-carried data dependencies, the observed IPC tends to a
+theoretical maximum which can be computed by dividing the number of instructions
+of a single iteration by the `Block RThroughput`.
 
 Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
 opcodes by the total number of cycles. A delta between Dispatch Width and this
 field is an indicator of a performance issue. In the absence of loop-carried
 data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
 maximum throughput which can be computed by dividing the number of uOps of a
-single iteration by the *Block RThroughput*.
+single iteration by the `Block RThroughput`.
 
 Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
 because the dispatch width limits the maximum size of a dispatch group. Both IPC
@@ -392,12 +403,12 @@ availability of hardware resources affec
 and it limits the number of instructions that can be executed in parallel every
 cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
 Cycle (computed by dividing the number of uOps of a single iteration by the
-*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+`Block RThroughput`) is an indicator of a performance bottleneck caused by the
 lack of hardware resources.
 In general, the lower the Block RThroughput, the better.
 
 In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
-are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
 approach 1.50 when the number of iterations tends to infinity. The delta between
 the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
 an indicator of a performance bottleneck caused by the lack of hardware
@@ -409,6 +420,13 @@ throughput of every instruction in the s
 extra information related to the number of micro opcodes, and opcode properties
 (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
 
+Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
+is computed as the maximum number of instructions of a same type that can be
+executed per clock cycle in the absence of operand dependencies. In this
+example, the reciprocal throughput of a vector float multiply is 1
+cycles/instruction.  That is because the FP multiplier JFPM is only available
+from pipeline JFPU1.
+
 The third section is the *Resource pressure view*.  This view reports
 the average number of resource cycles consumed every iteration by instructions
 for every processor resource unit available on the target.  Information is
@@ -540,6 +558,61 @@ resources, the delta between the two cou
 cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
 especially when compared to other low latency instructions.
 
+Bottleneck Analysis
+^^^^^^^^^^^^^^^^^^^
+The ``-bottleneck-analysis`` command line option enables the analysis of
+performance bottlenecks.
+
+This analysis is potentially expensive. It attempts to correlate increases in
+backend pressure (caused by pipeline resource pressure and data dependencies) to
+dynamic dispatch stalls.
+
+Below is an example of ``-bottleneck-analysis`` output generated by
+:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
+
+.. code-block:: none
+
+
+  Cycles with backend pressure increase [ 48.07% ]
+  Throughput Bottlenecks: 
+    Resource Pressure       [ 47.77% ]
+    - JFPA  [ 47.77% ]
+    - JFPU0  [ 47.77% ]
+    Data Dependencies:      [ 0.30% ]
+    - Register Dependencies [ 0.30% ]
+    - Memory Dependencies   [ 0.00% ]
+  
+  Critical sequence based on the simulation:
+  
+                Instruction                         Dependency Information
+   +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
+   |
+   |    < loop carried > 
+   |
+   |      0.    vmulps  %xmm0, %xmm1, %xmm2
+   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+   +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
+   |
+   |    < loop carried > 
+   |
+   +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+
+
+According to the analysis, throughput is limited by resource pressure and not by
+data dependencies.  The analysis observed increases in backend pressure during
+48.07% of the simulated run. Almost all those pressure increase events were
+caused by contention on processor resources JFPA/JFPU0.
+
+The `critical sequence` is the most expensive sequence of instructions according
+to the simulation. It is annotated to provide extra information about critical
+register dependencies and resource interferences between instructions.
+
+Instructions from the critical sequence are expected to significantly impact
+performance. By construction, the accuracy of this analysis is strongly
+dependent on the simulation and (as always) by the quality of the processor
+model in llvm.
+
+
 Extra Statistics to Further Diagnose Performance Issues
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 The ``-all-stats`` command line option enables extra statistics and performance