[llvm] r337648 - [llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC
Matt Davis via llvm-commits
llvm-commits at lists.llvm.org
Sat Jul 21 11:32:47 PDT 2018
Author: mattd
Date: Sat Jul 21 11:32:47 2018
New Revision: 337648
URL: http://llvm.org/viewvc/llvm-project?rev=337648&view=rev
Log:
[llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC
Summary: The original text was lifted from the MCA README. I re-ran the dot-product example and updated the output seen in the docs. I also added a few paragraphs discussing the instruction issued and retired histograms, as well as discussing the register file stats.
Reviewers: andreadb, RKSimon, courbet, gbedwell, filcab
Reviewed By: andreadb
Subscribers: tschuett
Differential Revision: https://reviews.llvm.org/D49614
Modified:
llvm/trunk/docs/CommandGuide/llvm-mca.rst
Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=337648&r1=337647&r2=337648&view=diff
==============================================================================
--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Sat Jul 21 11:32:47 2018
@@ -305,9 +305,9 @@ spent on average every iteration. The se
cycles to the machine instruction in the sequence. For example, every iteration
of the instruction vmulps always executes on resource unit [6]
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
-per iteration. Note that on Jaguar, vector floating-point multiply can only be
-issued to pipeline JFPU1, while horizontal floating-point additions can only be
-issued to pipeline JFPU0.
+per iteration. Note that on AMD Jaguar, vector floating-point multiply can
+only be issued to pipeline JFPU1, while horizontal floating-point additions can
+only be issued to pipeline JFPU0.
The resource pressure view helps with identifying bottlenecks caused by high
usage of specific hardware resources. Situations with resource pressure mainly
@@ -427,3 +427,125 @@ instructions. When performance is mostl
resources, the delta between the two counters is small. However, the number of
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.
+
+Extra Statistics to Further Diagnose Performance Issues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``-all-stats`` command line option enables extra statistics and performance
+counters for the dispatch logic, the reorder buffer, the retire control unit,
+and the register file.
+
+Below is an example of ``-all-stats`` output generated by MCA for the
+dot-product example discussed in the previous sections.
+
+.. code-block:: none
+
+ Dynamic Dispatch Stall Cycles:
+ RAT - Register unavailable: 0
+ RCU - Retire tokens unavailable: 0
+ SCHEDQ - Scheduler full: 272
+ LQ - Load queue full: 0
+ SQ - Store queue full: 0
+ GROUP - Static restrictions on the dispatch group: 0
+
+
+ Dispatch Logic - number of cycles where we saw N instructions dispatched:
+ [# dispatched], [# cycles]
+ 0, 24 (3.9%)
+ 1, 272 (44.6%)
+ 2, 314 (51.5%)
+
+
+ Schedulers - number of cycles where we saw N instructions issued:
+ [# issued], [# cycles]
+ 0, 7 (1.1%)
+ 1, 306 (50.2%)
+ 2, 297 (48.7%)
+
+
+ Scheduler's queue usage:
+ JALU01, 0/20
+ JFPU01, 18/18
+ JLSAGU, 0/12
+
+
+ Retire Control Unit - number of cycles where we saw N instructions retired:
+ [# retired], [# cycles]
+ 0, 109 (17.9%)
+ 1, 102 (16.7%)
+ 2, 399 (65.4%)
+
+
+ Register File statistics:
+ Total number of mappings created: 900
+ Max number of mappings used: 35
+
+ * Register File #1 -- JFpuPRF:
+ Number of physical registers: 72
+ Total number of mappings created: 900
+ Max number of mappings used: 35
+
+ * Register File #2 -- JIntegerPRF:
+ Number of physical registers: 64
+ Total number of mappings created: 0
+ Max number of mappings used: 0
+
+If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
+SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
+logic is unable to dispatch a group of two instructions because the scheduler's
+queue is full.
+
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able
+to dispatch two instructions 51.5% of the time. The dispatch group was limited
+to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
+dispatch statistics are displayed by either using the command option
+``-all-stats`` or ``-dispatch-stats``.
+
+The next table, *Schedulers*, presents a histogram displaying a count,
+representing the number of instructions issued on some number of cycles. In
+this case, of the 610 simulated cycles, single
+instructions were issued 306 times (50.2%) and there were 7 cycles where
+no instructions were issued.
+
+The *Scheduler's queue usage* table shows that the maximum number of buffer
+entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
+reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
+three schedulers:
+
+* JALU01 - A scheduler for ALU instructions.
+* JFPU01 - A scheduler floating point operations.
+* JLSAGU - A scheduler for address generation.
+
+The dot-product is a kernel of three floating point instructions (a vector
+multiply followed by two horizontal adds). That explains why only the floating
+point scheduler appears to be used.
+
+A full scheduler queue is either caused by data dependency chains or by a
+sub-optimal usage of hardware resources. Sometimes, resource pressure can be
+mitigated by rewriting the kernel using different instructions that consume
+different scheduler resources. Schedulers with a small queue are less resilient
+to bottlenecks caused by the presence of long data dependencies.
+The scheduler statistics are displayed by
+using the command option ``-all-stats`` or ``-scheduler-stats``.
+
+The next table, *Retire Control Unit*, presents a histogram displaying a count,
+representing the number of instructions retired on some number of cycles. In
+this case, of the 610 simulated cycles, two instructions were retired during
+the same cycle 399 times (65.4%) and there were 109 cycles where no
+instructions were retired. The retire statistics are displayed by using the
+command option ``-all-stats`` or ``-retire-stats``.
+
+The last table presented is *Register File statistics*. Each physical register
+file (PRF) used by the pipeline is presented in this table. In the case of AMD
+Jaguar, there are two register files, one for floating-point registers
+(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
+the 900 instructions processed, there were 900 mappings created. Since this
+dot-product example utilized only floating point registers, the JFPuPRF was
+responsible for creating the 900 mappings. However, we see that the pipeline
+only used a maximum of 35 of 72 available register slots at any given time. We
+can conclude that the floating point PRF was the only register file used for
+the example, and that it was never resource constrained. The register file
+statistics are displayed by using the command option ``-all-stats`` or
+``-register-file-stats``.
+
+In this example, we can conclude that the IPC is mostly limited by data
+dependencies, and not by resource pressure.
More information about the llvm-commits
mailing list