[llvm] r367853 - [MCA][doc] Add a section for the 'Bottleneck Analysis'.
Andrea Di Biagio via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 5 06:18:37 PDT 2019
Author: adibiagio
Date: Mon Aug 5 06:18:37 2019
New Revision: 367853
URL: http://llvm.org/viewvc/llvm-project?rev=367853&view=rev
Log:
[MCA][doc] Add a section for the 'Bottleneck Analysis'.
Also clarify the meaning of 'Block RThroughput' and 'RThroughput'.
Modified:
llvm/trunk/docs/CommandGuide/llvm-mca.rst
Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=367853&r1=367852&r2=367853&view=diff
==============================================================================
--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Mon Aug 5 06:18:37 2019
@@ -373,17 +373,28 @@ overview of the performance throughput.
**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
Throughput).
+Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
+to the out-of-order backend every simulated cycle.
+
IPC is computed dividing the total number of simulated instructions by the total
-number of cycles. In the absence of loop-carried data dependencies, the
-observed IPC tends to a theoretical maximum which can be computed by dividing
-the number of instructions of a single iteration by the *Block RThroughput*.
+number of cycles.
+
+Field *Block RThroughput* is the reciprocal of the block throughput. Block
+throuhgput is a theoretical quantity computed as the maximum number of blocks
+(i.e. iterations) that can be executed per simulated clock cycle in the absence
+of loop carried dependencies. Block throughput is is superiorly
+limited by the dispatch rate, and the availability of hardware resources.
+
+In the absence of loop-carried data dependencies, the observed IPC tends to a
+theoretical maximum which can be computed by dividing the number of instructions
+of a single iteration by the `Block RThroughput`.
Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
opcodes by the total number of cycles. A delta between Dispatch Width and this
field is an indicator of a performance issue. In the absence of loop-carried
data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
maximum throughput which can be computed by dividing the number of uOps of a
-single iteration by the *Block RThroughput*.
+single iteration by the `Block RThroughput`.
Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
because the dispatch width limits the maximum size of a dispatch group. Both IPC
@@ -392,12 +403,12 @@ availability of hardware resources affec
and it limits the number of instructions that can be executed in parallel every
cycle. A delta between Dispatch Width and the theoretical maximum uOps per
Cycle (computed by dividing the number of uOps of a single iteration by the
-*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
+`Block RThroughput`) is an indicator of a performance bottleneck caused by the
lack of hardware resources.
In general, the lower the Block RThroughput, the better.
In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
-are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
+are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
approach 1.50 when the number of iterations tends to infinity. The delta between
the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
an indicator of a performance bottleneck caused by the lack of hardware
@@ -409,6 +420,13 @@ throughput of every instruction in the s
extra information related to the number of micro opcodes, and opcode properties
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
+Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
+is computed as the maximum number of instructions of a same type that can be
+executed per clock cycle in the absence of operand dependencies. In this
+example, the reciprocal throughput of a vector float multiply is 1
+cycles/instruction. That is because the FP multiplier JFPM is only available
+from pipeline JFPU1.
+
The third section is the *Resource pressure view*. This view reports
the average number of resource cycles consumed every iteration by instructions
for every processor resource unit available on the target. Information is
@@ -540,6 +558,61 @@ resources, the delta between the two cou
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.
+Bottleneck Analysis
+^^^^^^^^^^^^^^^^^^^
+The ``-bottleneck-analysis`` command line option enables the analysis of
+performance bottlenecks.
+
+This analysis is potentially expensive. It attempts to correlate increases in
+backend pressure (caused by pipeline resource pressure and data dependencies) to
+dynamic dispatch stalls.
+
+Below is an example of ``-bottleneck-analysis`` output generated by
+:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
+
+.. code-block:: none
+
+
+ Cycles with backend pressure increase [ 48.07% ]
+ Throughput Bottlenecks:
+ Resource Pressure [ 47.77% ]
+ - JFPA [ 47.77% ]
+ - JFPU0 [ 47.77% ]
+ Data Dependencies: [ 0.30% ]
+ - Register Dependencies [ 0.30% ]
+ - Memory Dependencies [ 0.00% ]
+
+ Critical sequence based on the simulation:
+
+ Instruction Dependency Information
+ +----< 2. vhaddps %xmm3, %xmm3, %xmm4
+ |
+ | < loop carried >
+ |
+ | 0. vmulps %xmm0, %xmm1, %xmm2
+ +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
+ +----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
+ |
+ | < loop carried >
+ |
+ +----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
+
+
+According to the analysis, throughput is limited by resource pressure and not by
+data dependencies. The analysis observed increases in backend pressure during
+48.07% of the simulated run. Almost all those pressure increase events were
+caused by contention on processor resources JFPA/JFPU0.
+
+The `critical sequence` is the most expensive sequence of instructions according
+to the simulation. It is annotated to provide extra information about critical
+register dependencies and resource interferences between instructions.
+
+Instructions from the critical sequence are expected to significantly impact
+performance. By construction, the accuracy of this analysis is strongly
+dependent on the simulation and (as always) by the quality of the processor
+model in llvm.
+
+
Extra Statistics to Further Diagnose Performance Issues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``-all-stats`` command line option enables extra statistics and performance
More information about the llvm-commits
mailing list