[llvm] r338394 - [llvm-mca][docs] Always use `llvm-mca` in place of `MCA`.

Tue Jul 31 08:29:10 PDT 2018

Author: adibiagio
Date: Tue Jul 31 08:29:10 2018
New Revision: 338394

URL: http://llvm.org/viewvc/llvm-project?rev=338394&view=rev
Log:
[llvm-mca][docs] Always use `llvm-mca` in place of `MCA`.

Modified:
    llvm/trunk/docs/CommandGuide/llvm-mca.rst

Modified: llvm/trunk/docs/CommandGuide/llvm-mca.rst
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/docs/CommandGuide/llvm-mca.rst?rev=338394&r1=338393&r2=338394&view=diff
==============================================================================

--- llvm/trunk/docs/CommandGuide/llvm-mca.rst (original)
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst Tue Jul 31 08:29:10 2018
@@ -207,23 +207,23 @@ EXIT STATUS
 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 to standard error, and the tool returns 1.
 
-HOW MCA WORKS
--------------
+HOW LLVM-MCA WORKS
+------------------
 
-MCA takes assembly code as input. The assembly code is parsed into a sequence
-of MCInst with the help of the existing LLVM target assembly parsers. The
-parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
-a performance report.
+:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
+into a sequence of MCInst with the help of the existing LLVM target assembly
+parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
+to generate a performance report.
 
 The Pipeline module simulates the execution of the machine code sequence in a
 loop of iterations (default is 100). During this process, the pipeline collects
 a number of execution related statistics. At the end of this process, the
 pipeline generates and prints a report from the collected statistics.
 
-Here is an example of a performance report generated by MCA for a dot-product
-of two packed float vectors of four elements. The analysis is conducted for
-target x86, cpu btver2.  The following result can be produced via the following
-command using the example located at
+Here is an example of a performance report generated by the tool for a
+dot-product of two packed float vectors of four elements. The analysis is
+conducted for target x86, cpu btver2.  The following result can be produced via
+the following command using the example located at
 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
 
 .. code-block:: bash
@@ -316,7 +316,7 @@ pressure should be uniformly distributed
 
 Timeline View
 ^^^^^^^^^^^^^
-MCA's timeline view produces a detailed report of each instruction's state
+The timeline view produces a detailed report of each instruction's state
 transitions through an instruction pipeline.  This view is enabled by the
 command line option ``-timeline``.  As instructions transition through the
 various stages of the pipeline, their states are depicted in the view report.
@@ -331,7 +331,7 @@ These states are represented by the foll
 
 Below is the timeline view for a subset of the dot-product example located in
 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
-MCA using the following command:
+:program:`llvm-mca` using the following command:
 
 .. code-block:: bash
 
@@ -366,7 +366,7 @@ MCA using the following command:
   2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
 
 The timeline view is interesting because it shows instruction state changes
-during execution.  It also gives an idea of how MCA processes instructions
+during execution.  It also gives an idea of how the tool processes instructions
 executed on the target, and how their timing information might be calculated.
 
 The timeline view is structured in two tables.  The first table shows
@@ -415,8 +415,8 @@ and therefore consuming temporary regist
 
 Table *Average Wait times* helps diagnose performance issues that are caused by
 the presence of long latency instructions and potentially long data dependencies
-which may limit the ILP.  Note that MCA, by default, assumes at least 1cy
-between the dispatch event and the issue event.
+which may limit the ILP.  Note that :program:`llvm-mca`, by default, assumes at
+least 1cy between the dispatch event and the issue event.
 
 When the performance is limited by data dependencies and/or long latency
 instructions, the number of cycles spent while in the *ready* state is expected
@@ -602,9 +602,9 @@ entries in the reorder buffer defaults t
 the target scheduling model.
 
 Instructions that are dispatched to the schedulers consume scheduler buffer
-entries.  MCA queries the scheduling model to determine the set of
-buffered resources consumed by an instruction.  Buffered resources are treated
-like scheduler resources.
+entries. :program:`llvm-mca` queries the scheduling model to determine the set
+of buffered resources consumed by an instruction.  Buffered resources are
+treated like scheduler resources.
 
 Instruction Issue
 """""""""""""""""
@@ -612,22 +612,21 @@ Each processor scheduler implements a bu
 has to wait in the scheduler's buffer until input register operands become
 available.  Only at that point, does the instruction becomes eligible for
 execution and may be issued (potentially out-of-order) for execution.
-Instruction latencies are computed by MCA with the help of the scheduling
-model.
+Instruction latencies are computed by :program:`llvm-mca` with the help of the
+scheduling model.
 
-MCA's scheduler is designed to simulate multiple processor schedulers.  The
-scheduler is responsible for tracking data dependencies, and dynamically
-selecting which processor resources are consumed by instructions.
-
-The scheduler delegates the management of processor resource units and resource
-groups to a resource manager.  The resource manager is responsible for
-selecting resource units that are consumed by instructions.  For example, if an
-instruction consumes 1cy of a resource group, the resource manager selects one
-of the available units from the group; by default, the resource manager uses a
+:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
+schedulers.  The scheduler is responsible for tracking data dependencies, and
+dynamically selecting which processor resources are consumed by instructions.
+It delegates the management of processor resource units and resource groups to a
+resource manager.  The resource manager is responsible for selecting resource
+units that are consumed by instructions.  For example, if an instruction
+consumes 1cy of a resource group, the resource manager selects one of the
+available units from the group; by default, the resource manager uses a
 round-robin selector to guarantee that resource usage is uniformly distributed
 between all units of a group.
 
-MCA's scheduler implements three instruction queues:
+:program:`llvm-mca`'s scheduler implements three instruction queues:
 
 * WaitQueue: a queue of instructions whose operands are not ready.
 * ReadyQueue: a queue of instructions ready to execute.
@@ -638,8 +637,8 @@ scheduler are either placed into the Wai
 
 Every cycle, the scheduler checks if instructions can be moved from the
 WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
-issued.  The algorithm prioritizes older instructions over younger
-instructions.
+issued to the underlying pipelines. The algorithm prioritizes older instructions
+over younger instructions.
 
 Write-Back and Retire Stage
 """""""""""""""""""""""""""
@@ -656,15 +655,13 @@ for the instruction during the register
 
 Load/Store Unit and Memory Consistency Model
 """"""""""""""""""""""""""""""""""""""""""""
-To simulate an out-of-order execution of memory operations, MCA utilizes a
-simulated load/store unit (LSUnit) to simulate the speculative execution of
-loads and stores.
-
-Each load (or store) consumes an entry in the load (or store) queue.  The
-number of slots in the load/store queues is unknown by MCA, since there is no
-mention of it in the scheduling model.  In practice, users can specify flags
-``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
-store queues respectively.  The queues are unbounded by default.
+To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
+utilizes a simulated load/store unit (LSUnit) to simulate the speculative
+execution of loads and stores.
+
+Each load (or store) consumes an entry in the load (or store) queue. Users can
+specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
+load and store queues respectively. The queues are unbounded by default.
 
 The LSUnit implements a relaxed consistency model for memory loads and stores.
 The rules are:
@@ -701,15 +698,15 @@ cache.  It only knows if an instruction
 loads, the scheduling model provides an "optimistic" load-to-use latency (which
 usually matches the load-to-use latency for when there is a hit in the L1D).
 
-MCA does not know about serializing operations or memory-barrier like
-instructions.  The LSUnit conservatively assumes that an instruction which has
-both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
-That means, it serializes loads without forcing a flush of the load queue.
-Similarly, instructions that "MayStore" and have unmodeled side effects are
-treated like store barriers.  A full memory barrier is a "MayLoad" and
-"MayStore" instruction with unmodeled side effects.  This is inaccurate, but it
-is the best that we can do at the moment with the current information available
-in LLVM.
+:program:`llvm-mca` does not know about serializing operations or memory-barrier
+like instructions.  The LSUnit conservatively assumes that an instruction which
+has both "MayLoad" and unmodeled side effects behaves like a "soft"
+load-barrier.  That means, it serializes loads without forcing a flush of the
+load queue.  Similarly, instructions that "MayStore" and have unmodeled side
+effects are treated like store barriers.  A full memory barrier is a "MayLoad"
+and "MayStore" instruction with unmodeled side effects.  This is inaccurate, but
+it is the best that we can do at the moment with the current information
+available in LLVM.
 
 A load/store barrier consumes one entry of the load/store queue.  A load/store
 barrier enforces ordering of loads/stores.  A younger load cannot pass a load