[llvm] r337648 - [llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC

Sat Jul 21 11:32:47 PDT 2018

Author: mattd
Date: Sat Jul 21 11:32:47 2018
New Revision: 337648

URL: http://llvm.org/viewvc/llvm-project?rev=337648&view=rev
Log:
[llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC

Summary: The original text was lifted from the MCA README.  I re-ran the dot-product example and updated the output seen in the docs.  I also added a few paragraphs discussing the instruction issued and retired histograms, as well as discussing the register file stats.

Reviewers: andreadb, RKSimon, courbet, gbedwell, filcab

Reviewed By: andreadb

Subscribers: tschuett

Differential Revision: https://reviews.llvm.org/D49614

Modified:
    llvm/trunk/docs/CommandGuide/llvm-mca.rst

Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=337648&r1=337647&r2=337648&view=diff
==============================================================================

--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Sat Jul 21 11:32:47 2018
@@ -305,9 +305,9 @@ spent on average every iteration. The se
 cycles to the machine instruction in the sequence. For example, every iteration
 of the instruction vmulps always executes on resource unit [6]
 (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
-per iteration.  Note that on Jaguar, vector floating-point multiply can only be
-issued to pipeline JFPU1, while horizontal floating-point additions can only be
-issued to pipeline JFPU0.
+per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
+only be issued to pipeline JFPU1, while horizontal floating-point additions can
+only be issued to pipeline JFPU0.
 
 The resource pressure view helps with identifying bottlenecks caused by high
 usage of specific hardware resources.  Situations with resource pressure mainly
@@ -427,3 +427,125 @@ instructions.  When performance is mostl
 resources, the delta between the two counters is small.  However, the number of
 cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
 especially when compared to other low latency instructions.
+
+Extra Statistics to Further Diagnose Performance Issues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``-all-stats`` command line option enables extra statistics and performance
+counters for the dispatch logic, the reorder buffer, the retire control unit,
+and the register file.
+
+Below is an example of ``-all-stats`` output generated by MCA for the
+dot-product example discussed in the previous sections.
+
+.. code-block:: none
+
+  Dynamic Dispatch Stall Cycles:
+  RAT     - Register unavailable:                      0
+  RCU     - Retire tokens unavailable:                 0
+  SCHEDQ  - Scheduler full:                            272
+  LQ      - Load queue full:                           0
+  SQ      - Store queue full:                          0
+  GROUP   - Static restrictions on the dispatch group: 0
+
+
+  Dispatch Logic - number of cycles where we saw N instructions dispatched:
+  [# dispatched], [# cycles]
+   0,              24  (3.9%)
+   1,              272  (44.6%)
+   2,              314  (51.5%)
+
+
+  Schedulers - number of cycles where we saw N instructions issued:
+  [# issued], [# cycles]
+   0,          7  (1.1%)
+   1,          306  (50.2%)
+   2,          297  (48.7%)
+
+
+  Scheduler's queue usage:
+  JALU01,  0/20
+  JFPU01,  18/18
+  JLSAGU,  0/12
+
+
+  Retire Control Unit - number of cycles where we saw N instructions retired:
+  [# retired], [# cycles]
+   0,           109  (17.9%)
+   1,           102  (16.7%)
+   2,           399  (65.4%)
+
+
+  Register File statistics:
+  Total number of mappings created:    900
+  Max number of mappings used:         35
+
+  *  Register File #1 -- JFpuPRF:
+     Number of physical registers:     72
+     Total number of mappings created: 900
+     Max number of mappings used:      35
+
+  *  Register File #2 -- JIntegerPRF:
+     Number of physical registers:     64
+     Total number of mappings created: 0
+     Max number of mappings used:      0
+
+If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
+SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
+logic is unable to dispatch a group of two instructions because the scheduler's
+queue is full.
+
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able
+to dispatch two instructions 51.5% of the time.  The dispatch group was limited
+to one instruction 44.6% of the cycles, which corresponds to 272 cycles.  The
+dispatch statistics are displayed by either using the command option
+``-all-stats`` or ``-dispatch-stats``.
+
+The next table, *Schedulers*, presents a histogram displaying a count,
+representing the number of instructions issued on some number of cycles.  In
+this case, of the 610 simulated cycles, single
+instructions were issued 306 times (50.2%) and there were 7 cycles where
+no instructions were issued.
+
+The *Scheduler's queue usage* table shows that the maximum number of buffer
+entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
+reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
+three schedulers:
+
+* JALU01 - A scheduler for ALU instructions.
+* JFPU01 - A scheduler floating point operations.
+* JLSAGU - A scheduler for address generation.
+
+The dot-product is a kernel of three floating point instructions (a vector
+multiply followed by two horizontal adds).  That explains why only the floating
+point scheduler appears to be used.
+
+A full scheduler queue is either caused by data dependency chains or by a
+sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
+mitigated by rewriting the kernel using different instructions that consume
+different scheduler resources.  Schedulers with a small queue are less resilient
+to bottlenecks caused by the presence of long data dependencies.
+The scheduler statistics are displayed by
+using the command option ``-all-stats`` or ``-scheduler-stats``.
+
+The next table, *Retire Control Unit*, presents a histogram displaying a count,
+representing the number of instructions retired on some number of cycles.  In
+this case, of the 610 simulated cycles, two instructions were retired during
+the same cycle 399 times (65.4%) and there were 109 cycles where no
+instructions were retired.  The retire statistics are displayed by using the
+command option ``-all-stats`` or ``-retire-stats``.
+
+The last table presented is *Register File statistics*.  Each physical register
+file (PRF) used by the pipeline is presented in this table.  In the case of AMD
+Jaguar, there are two register files, one for floating-point registers
+(JFpuPRF) and one for integer registers (JIntegerPRF).  The table shows that of
+the 900 instructions processed, there were 900 mappings created.  Since this
+dot-product example utilized only floating point registers, the JFPuPRF was
+responsible for creating the 900 mappings.  However, we see that the pipeline
+only used a maximum of 35 of 72 available register slots at any given time. We
+can conclude that the floating point PRF was the only register file used for
+the example, and that it was never resource constrained.  The register file
+statistics are displayed by using the command option ``-all-stats`` or
+``-register-file-stats``.
+
+In this example, we can conclude that the IPC is mostly limited by data
+dependencies, and not by resource pressure.