<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>This looks absolutely awesome. Thank you for sharing your work;
I can see many ways this will be useful.</p>
<p>Philip<br>
</p>
<br>
<div class="moz-cite-prefix">On 03/01/2018 09:22 AM, Andrea Di
Biagio via llvm-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAMw=4mL+bANeYnUifa3f39o4mvY9bdPfdqXr6HJSXkO0KUB5Vw@mail.gmail.com">
<div dir="ltr">Hi all,<br>
<br>
At Sony we developed an LLVM based performance analysis tool
named llvm-mca. We<br>
currently use it internally to statically measure the
performance of code, and<br>
to help triage potential problems with target scheduling
models. We decided to<br>
post this RFC because we are interested in the feedback from the
community, and<br>
we also believe that other people might be interested in a tool
like this.<br>
<br>
llvm-mca uses information which is already available in LLVM
(e.g. scheduling<br>
models) to statically measure the performance of machine code in
a specific cpu.<br>
Performance is measured in terms of throughput as well as
processor resource<br>
consumption. The tool currently works for processors with an
out-of-order<br>
backend, for which there is a scheduling model available in
LLVM.<br>
<br>
The main goal of this tool is not just to predict the
performance of the code<br>
when run on the target, but also help with diagnosing potential
performance<br>
issues.<br>
<br>
Given an assembly code sequence, llvm-mca estimates the IPC
(instructions per<br>
cycle), as well as hardware resources pressure. The analysis and
reporting style<br>
were inspired by the IACA tool from Intel.<br>
<br>
The presence of long data dependency chains, as well as poor
usage of hardware<br>
resources may lead to bottlenecks in the back-end. The tool is
able to generate<br>
a detailed report which should help with identifying and
analyzing sources of<br>
bottlenecks.<br>
<br>
Scheduling models are mostly used to compute instruction
latencies, to obtain<br>
read-advance information, and understand how processor resources
are used by<br>
instructions. By design, the quality of the performance
analysis conducted by<br>
the tool is inevitably affected by the quality of the target
scheduling models. <br>
<br>
However, scheduling models intentionally do not describe all
processors details,<br>
since the goal is just to enable the scheduling of machine
instructions during<br>
compilation. That means, there are processor details which are
not important for<br>
the purpose of scheduling instructions (and therefore not
described by the<br>
scheduling model), but are very important for this tool.<br>
<br>
A few examples of details that are missing in scheduling models
are:<br>
- Maximum number of instructions retired per cycle.<br>
- Actual dispatch width (it often differs from the issue
width).<br>
- Number of temporary registers available for renaming.<br>
- Number of read/write ports in the register file(s).<br>
- Length of the load/store queue in the LSUnit.<br>
<br>
It is also very difficult to find a "good" abstract model to
describe the<br>
behavior of out-of-order processors. So, we have to keep in mind
that all of<br>
these aspects are going to affect the quality of the static
analysis performed<br>
by the tool.<br>
<br>
An extensive list of known limitations is reported in one of the
last sections<br>
of this document. There is also a section related to design
problems which must<br>
be addressed (hopefully with the help of the community). At the
moment, the<br>
tool has been mostly tested for x86 targets, but there are still
several<br>
limitations, some of which could be overcome by integrating
extra information<br>
into the scheduling models.<br>
<br>
As was mentioned before, this tool has been (and is still being)
used internally<br>
in Sony to debug/triage issues in the btver2 scheduling model.
We have also<br>
tested it on other targets to check how generic the tool is. In
our experience,<br>
the tool makes it easy to identify simple mistakes like "wrong
number of micro<br>
opcodes specified for an instruction", or "wrong set of hardware
resources".<br>
Some of these mistakes are quite common (sometimes on mature
models too), and<br>
often difficult to catch. Reports generated by this tool are
simple to analyze,<br>
and contain enough details to help triage most performance
problems.<br>
<br>
1. How the tool works<br>
---------------------<br>
<br>
The tool takes assembly code as input. Assembly code is parsed
into a sequence<br>
of MCInst with the help of the existing LLVM target assembly
parsers. The parsed<br>
sequence of MCInst is then analyzed by a 'Backend' module to
generate a<br>
performance report.<br>
<br>
The Backend module internally emulates the execution of the
machine code<br>
sequence in a loop of iterations (which by default is 70). At
the end of this<br>
process, the backend collects a number of statistics which are
then printed out<br>
in the form of a report.<br>
<br>
Here is an example of performance report generated by the tool
for a dot-product<br>
of two packed float vectors of four elements. The analysis is
conducted for<br>
target x86, cpu btver2:<br>
<br>
///////////////////<br>
<br>
Iterations: 300<br>
Instructions: 900<br>
Total Cycles: 610<br>
Dispatch Width: 2<br>
IPC: 1.48<br>
<br>
<br>
Resources:<br>
[0] - JALU0<br>
[1] - JALU1<br>
[2] - JDiv<br>
[3] - JFPM<br>
[4] - JFPU0<br>
[5] - JFPU1<br>
[6] - JLAGU<br>
[7] - JSAGU<br>
[8] - JSTC<br>
[9] - JVIMUL<br>
<br>
<br>
Resource pressure per iteration:<br>
[0] [1] [2] [3] [4] [5] [6] [7] [8]
[9] <br>
- - - - 2.00 1.00 - - -
- <br>
<br>
Resource pressure by instruction:<br>
[0] [1] [2] [3] [4] [5] [6] [7] [8]
[9] Instructions:<br>
- - - - - 1.00 - - -
- vmulps %xmm0, %xmm1, %xmm2<br>
- - - - 1.00 - - - -
- vhaddps %xmm2, %xmm2, %xmm3<br>
- - - - 1.00 - - - -
- vhaddps %xmm3, %xmm3, %xmm4<br>
<br>
<br>
Instruction Info:<br>
[1]: #uOps<br>
[2]: Latency<br>
[3]: RThroughput<br>
[4]: MayLoad<br>
[5]: MayStore<br>
[6]: HasSideEffects<br>
<br>
[1] [2] [3] [4] [5] [6] Instructions:<br>
1 2 1.00 vmulps %xmm0,
%xmm1, %xmm2<br>
1 3 1.00 vhaddps %xmm2,
%xmm2, %xmm3<br>
1 3 1.00 vhaddps %xmm3,
%xmm3, %xmm4<br>
<br>
///////////////////<br>
<br>
According to this report, the dot-product kernel has been
executed 300 times,<br>
for a total of 900 instructions dynamically executed.<br>
<br>
The report is structured in three main sections. A first
section collects a few<br>
performance numbers; the goal of this section is to give a very
quick overview<br>
of the performance throughput. In this example, the two
important perforamce<br>
indicators are a) the predicted total number of cycles, and b)
the IPC. <br>
IPC is probably the most important throughput indicator. A big
delta between the<br>
Dispatch Width and the computed IPC is an indicator of potential
performance<br>
issues.<br>
<br>
The second section is the so-called "resource pressure view".
This view reports<br>
the average number of resource cycles consumed every iteration
by instructions<br>
for every processor resource unit available on the target.
Information is<br>
structured in two tables. The first table reports the number of
resource cycles<br>
spent on average every iteration. The second table correlates
the resource<br>
cycles to the machine instruction in the sequence. For example,
every iteration<br>
of the dot-product, instruction 'vmulps' always executes on
resource unit [5]<br>
(JFPU1 - floating point pipeline #1), consuming an average of 1
resource cycle<br>
per iteration. Note that on Jaguar, vector FP multiply can only
be issued to<br>
pipeline JFPU1, while horizontal FP adds can only be issued to
pipeline JFPU0.<br>
<br>
The third (and last) section of the report shows the latency and
reciprocal<br>
throughput of every instruction in the sequence. That section
also reports extra<br>
information related to the number of micro opcodes, and opcode
properties (i.e.<br>
'MayLoad', 'MayStore' and 'UnmodeledSideEffects').<br>
<br>
The resource pressure view helps with identifying bottlenecks
caused by high<br>
usage of specific hardware resources. Situations with resource
pressure mainly<br>
concentrated on a few resources should, in general, be avoided.
Ideally,<br>
pressure should be uniformly distributed between multiple
resources.<br>
<br>
Timeline View<br>
-------------<br>
<br>
A detailed report of each instruction's state transitions over
time can be<br>
enabled using the command line flag '-timeline'. This prints an
extra section<br>
in the report which contains the so-called "timeline view".
Below is the<br>
timeline view for the dot-product example from the previous
section.<br>
<br>
///////////////<br>
Timeline view:<br>
012345<br>
Index 0123456789 <br>
<br>
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2<br>
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3<br>
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4<br>
<br>
[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2<br>
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3<br>
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4<br>
<br>
[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2<br>
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3<br>
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4<br>
<br>
<br>
Average Wait times (based on the timeline view):<br>
[0]: Executions<br>
[1]: Average time spent waiting in a scheduler's queue<br>
[2]: Average time spent waiting in a scheduler's queue while
ready<br>
[3]: Average time elapsed from WB until retire stage<br>
<br>
[0] [1] [2] [3]<br>
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1,
%xmm2<br>
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2,
%xmm3<br>
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3,
%xmm4<br>
///////////////<br>
<br>
The timeline view is very interesting because it shows how
instructions changed<br>
in state during execution. It also gives an idea of how the
tool "sees"<br>
instructions executed on the target.<br>
<br>
The timeline view is structured in two tables. The first table
shows how<br>
instructions change in state over time (measured in cycles); the
second table<br>
(named "Average Wait times") reports useful timing statistics
which should help<br>
diagnose performance bottlenecks caused by long data
dependencies and sub-optimal<br>
usage of hardware resources.<br>
<br>
An instruction in the timeline view is identified by a pair of
indices, where<br>
the 'first' index identifies an iteration, and the 'second'
index is the actual<br>
instruction index (i.e. where it appears in the code sequence).<br>
<br>
Excluding the first and last column, the remaining columns are
in cycles. Cycles<br>
are numbered sequentially starting from 0. The following
characters are used to<br>
describe the state of an instruction:<br>
<br>
D : Instruction dispatched.<br>
e : Instruction executing.<br>
E : Instruction executed.<br>
R : Instruction retired.<br>
= : Instruction already dispatched, waiting to be executed.<br>
- : Instruction executed, waiting to be retired.<br>
<br>
Based on the timeline view from the example, we know that:<br>
- Instruction [1, 0] was dispatched at cycle 1.<br>
- Instruction [1, 0] started executing at cycle 2.<br>
- Instruction [1, 0] reached the write back stage at cycle 4.<br>
- Instruction [1, 0] was retired at cycle 10.<br>
<br>
Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't
have to wait in<br>
the scheduler's queue for the operands to become available. By
the time the<br>
vmulps is dispatched, operands are already available, and
pipeline JFPU1 is<br>
ready to serve another instruction. So the instruction can be
immediately<br>
issued on the JFPU1 pipeline. That is demonstrated by the fact
that the<br>
instruction only spent 1cy in the scheduler's queue.<br>
<br>
There is a gap of 5 cycles between the write-back stage and the
retire event.<br>
That is because instructions must retire in program order, so
[1,0] has to wait<br>
for [0, 2] to be retired first (i.e it has to wait unti cycle
10).<br>
<br>
In the dot-product example, all instructions are in a RAW (Read
After Write)<br>
dependency chain. Register %xmm2 written by the vmulps is
immediately used by<br>
the first vhaddps, and register %xmm3 written by the first
vhaddps is used by<br>
the second vhaddps. Long data dependencies negatively affect
the ILP<br>
(Instruction Level Parallelism).<br>
<br>
In the dot-product example, there are anti-dependencies
introduced by<br>
instructions from different iterations. However, those
dependencies can be<br>
removed at register renaming stage (at the cost of allocating
register aliases,<br>
and therefore consuming temporary registers).<br>
<br>
Table "Average Wait times" helps diagnose performance issues
that are caused by<br>
the presence of long latency instructions and potentially long
data dependencies<br>
which may limit the ILP. Note that the tool by default assumes
at least 1cy<br>
between the dispatch event and the issue event.<br>
<br>
When the performance is limited by data dependencies and/or long
latency<br>
instructions, the number of cycles spent while in the "ready"
state is expected<br>
to be very small when compared with the total number of cycles
spent in the<br>
scheduler's queue. So the difference between the two counters
is a good<br>
indicator of how big of an impact data dependencies had on the
execution of<br>
instructions. When performance is mostly limited by the lack of
hardware<br>
resources, the delta between the two counters is small.
However, the number of<br>
cycles spent in the queue tends to be bigger (i.e. more than
1-3cy) especially<br>
when compared with other low latency instructions.<br>
<br>
Extra statistics to further diagnose performance issues.<br>
--------------------------------------------------------<br>
<br>
Flag '-verbose' enables extra statistics and performance
counters for the<br>
dispatch logic, the reorder buffer, the retire control unit and
the register<br>
file.<br>
<br>
Below is an example of verbose output generated by the tool for
the dot-product<br>
example discussed in the previous sections.<br>
<br>
///////////////////<br>
Iterations: 300<br>
Instructions: 900<br>
Total Cycles: 610<br>
Dispatch Width: 2<br>
IPC: 1.48<br>
<br>
<br>
Dynamic Dispatch Stall Cycles:<br>
RAT - Register unavailable: 0<br>
RCU - Retire tokens unavailable: 0<br>
SCHEDQ - Scheduler full: 272<br>
LQ - Load queue full: 0<br>
SQ - Store queue full: 0<br>
GROUP - Static restrictions on the dispatch group: 0<br>
<br>
<br>
Register Alias Table:<br>
Total number of mappings created: 900<br>
Max number of mappings used: 35<br>
<br>
<br>
Dispatch Logic - number of cycles where we saw N instructions
dispatched:<br>
[# dispatched], [# cycles]<br>
0, 24 (3.9%)<br>
1, 272 (44.6%)<br>
2, 314 (51.5%)<br>
<br>
<br>
Schedulers - number of cycles where we saw N instructions
issued:<br>
[# issued], [# cycles]<br>
0, 7 (1.1%)<br>
1, 306 (50.2%)<br>
2, 297 (48.7%)<br>
<br>
<br>
Retire Control Unit - number of cycles where we saw N
instructions retired:<br>
[# retired], [# cycles]<br>
0, 109 (17.9%)<br>
1, 102 (16.7%)<br>
2, 399 (65.4%)<br>
<br>
<br>
Scheduler's queue usage:<br>
JALU01, 0/20<br>
JFPU01, 18/18<br>
JLSAGU, 0/12<br>
///////////////////<br>
<br>
Based on the verbose report, the backend was only able to
dispatch two<br>
instructions 51.5% of the time. The dispatch group was limited
to one<br>
instruction 44.6% of the cycles, which corresponds to 272
cycles.<br>
<br>
If we look at section "Dynamic Dispatch Stall Cycles", we can
see how counter<br>
SCHEDQ reports 272 cycles. Counter SCHEDQ is incremented every
time the dispatch<br>
logic is unable to dispatch a full group of two instructions
because the<br>
scheduler's queue is full.<br>
<br>
Section "Scheduler's queue usage" shows how the maximum number
of buffer entries<br>
(i.e. scheduler's queue entries) used at runtime for resource
JFPU01 reached its<br>
maximum. Note that AMD Jaguar implements three schedulers:<br>
* JALU01 - A scheduler for ALU instructions<br>
* JLSAGU - A scheduler for address generation<br>
* JFPU01 - A scheduler floating point operations.<br>
<br>
The dot-product is a kernel of three floating point instructions
(a vector<br>
multiply followed by two horizontal adds). That explains why
only the floating<br>
point scheduler appears to be used according to section
"Scheduler's queue<br>
usage".<br>
<br>
A full scheduler's queue is either caused by data dependency
chains, or by a<br>
sub-optimal usage of hardware resources. Sometimes, resource
pressure can be<br>
mitigated by rewriting the kernel using different instructions
that consume<br>
different scheduler resources. Schedulers with a small queue
are less resilient<br>
to bottlenecks caused by the presence of long data dependencies.<br>
<br>
In this example, we can conclude that the IPC is mostly limited
by data<br>
dependencies, and not by resource pressure.<br>
<br>
LLVM-MCA instruction flow<br>
-------------------------<br>
<br>
This section describes the instruction flow through the
out-of-order backend, as<br>
well as the functional units involved in the process. <br>
<br>
An instruction goes through a default sequence of stages:<br>
- Dispatch (Instruction is dispatched to the schedulers).<br>
- Issue (Instruction is issued to the processor pipelines).<br>
- Write Back (Instruction is executed, and results are
written back).<br>
- Retire (Instruction is retired; writes are architecturally
committed).<br>
<br>
The tool only models the out-of-order portion of a processor.
Therefore, the<br>
instruction fetch and decode stages are not modeled. Performance
bottlenecks in<br>
the frontend are not diagnosed by this tool. The tool assumes
that instructions<br>
have all been decoded and placed in a queue. Also, the tool
doesn't know<br>
anything about branch prediction.<br>
<br>
Instruction Dispatch<br>
--------------------<br>
<br>
During the Dispatch stage, instructions are picked in program
order from a queue<br>
of already decoded instructions, and dispatched in groups to the
hardware<br>
schedulers. The dispatch logic is implemented by class
DispatchUnit in file<br>
Dispatch.h.<br>
<br>
The size of a dispatch group depends on the availability of
hardware resources,<br>
and it cannot exceed the value of field 'DispatchWidth' in class
DispatchUnit.<br>
Note that field DispatchWidth defaults to the value of field
'IssueWidth' from<br>
the scheduling model.<br>
<br>
Users can override the DispatchWidth value with flag
"-dispatch=<N>" (where 'N'<br>
is an unsigned quantity).<br>
<br>
An instruction can be dispatched if:<br>
- The size of the dispatch group is smaller than DispatchWidth<br>
- There are enough entries in the reorder buffer<br>
- There are enough temporary registers to do register renaming<br>
- Schedulers are not full.<br>
<br>
Scheduling models don't describe register files, and therefore
the tool doesn't<br>
know if there is more than one register file, and how many
temporaries are<br>
available for register renaming.<br>
<br>
By default, the tool (optimistically) assumes a single register
file with an<br>
unbounded number of temporary registers. Users can limit the
number of<br>
temporary registers available for register renaming using flag<br>
`-register-file-size=<N>`, where N is the number of
temporaries. A value of<br>
zero for N means 'unbounded'. Knowing how many temporaries are
available for<br>
register renaming, the tool can predict dispatch stalls caused
by the lack of<br>
temporaries.<br>
<br>
The number of reorder buffer entries consumed by an instruction
depends on the<br>
number of micro-opcodes it specifies in the target scheduling
model (see field<br>
'NumMicroOpcodes' of tablegen class ProcWriteResources and its
derived classes;<br>
TargetSchedule.td).<br>
<br>
The reorder buffer is implemented by class RetireControlUnit
(see Dispatch.h).<br>
Its goal is to track the progress of instructions that are
"in-flight", and<br>
retire instructions in program order. The number of entries in
the reorder<br>
buffer defaults to the value of field 'MicroOpBufferSize' from
the target<br>
scheduling model.<br>
<br>
Instructions that are dispatched to the schedulers consume
scheduler buffer<br>
entries. The tool queries the scheduling model to figure out
the set of<br>
buffered resources consumed by an instruction. Buffered
resources are treated<br>
like "scheduler" resources, and the field 'BufferSize' (from the
processor<br>
resource tablegen definition) defines the size of the
scheduler's queue.<br>
<br>
Zero latency instructions (for example NOP instructions) don't
consume scheduler<br>
resources. However, those instructions still reserve a number
of slots in the<br>
reorder buffer.<br>
<br>
Instruction Issue<br>
-----------------<br>
<br>
As mentioned in the previous section, each scheduler resource
implements a queue<br>
of instructions. An instruction has to wait in the scheduler's
queue until<br>
input register operands become available. Only at that point,
does the<br>
instruction becomes eligible for execution and may be issued
(potentially<br>
out-of-order) to a pipeline for execution.<br>
<br>
Instruction latencies can be computed by the tool with the help
of the<br>
scheduling model; latency values are defined by the scheduling
model through<br>
ProcWriteResources objects.<br>
<br>
Class Scheduler (see file Scheduler.h) knows how to emulate
multiple processor<br>
schedulers. A Scheduler is responsible for tracking data
dependencies, and<br>
dynamically select which processor resources are consumed/used
by instructions.<br>
<br>
Internally, the Scheduler class delegates the management of
processor resource<br>
units and resource groups to the ResourceManager class.
ResourceManager is also<br>
responsible for selecting resource units that are effectively
consumed by<br>
instructions. For example, if an instruction consumes 1cy of a
resource group,<br>
the ResourceManager object selects one of the available units
from the group; by<br>
default, it uses a round-robin selector to guarantee that
resource usage is<br>
uniformly distributed between all units of a group.<br>
<br>
Internally, class Scheduler implements three instruction queues:<br>
- WaitQueue: a queue of instructions whose operands are not
ready yet.<br>
- ReadyQueue: a queue of instructions ready to execute.<br>
- IssuedQueue: a queue of instructions executing.<br>
<br>
Depending on the operands availability, instructions that are
dispatched to the<br>
Scheduler are either placed into the WaitQueue or into the
ReadyQueue.<br>
<br>
Every cycle, class Scheduler checks if instructions can be moved
from the<br>
WaitQueue to the ReadyQueue, and if instructions from the
ReadyQueue can be<br>
issued to the underlying pipelines. The algorithm prioritizes
older<br>
instructions over younger instructions.<br>
<br>
Objects of class ResourceState (see Scheduler.h) describe
processor resources.<br>
There is an instance of class ResourceState for each single
processor resource<br>
specified by the scheduling model. A ResourceState object for a
processor<br>
resource with multiple units dynamically tracks the availability
of every single<br>
unit. For example, the ResourceState of a resource group tracks
the<br>
availability of every resource in that group. Internally,
ResourceState<br>
implements a round-robin selector to dynamically pick the next
unit to use from<br>
the group.<br>
<br>
Write-Back and Retire Stage<br>
---------------------------<br>
<br>
Issued instructions are moved from the ReadyQueue to the
IssuedQueue. There,<br>
instructions wait until they reach the write-back stage. At
that point, they<br>
get removed from the queue and the retire control unit is
notified.<br>
<br>
On the event of "instruction executed", the retire control unit
flags the<br>
instruction as "ready to retire".<br>
<br>
Instruction are retired in program order; an "instruction
retired" event is sent<br>
to the register file which frees the temporary registers
allocated for the<br>
instruction at register renaming stage.<br>
<br>
Load/Store Unit and Memory Consistency Model<br>
--------------------------------------------<br>
<br>
The tool attempts to emulate out-of-order execution of memory
operations. Class<br>
LSUnit (see file LSUnit.h) emulates a load/store unit
implementing queues for<br>
speculative execution of loads and stores.<br>
<br>
Each load (or store) consumes an entry in the load (or store)
queue. The number<br>
of slots in the load/store queues is unknown by the tool, since
there is no<br>
mention of it in the scheduling model. In practice, users can
specify flag<br>
`-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in
the queue to be<br>
equal to exactly N (an unsigned value). If N is zero, then the
tool assumes an<br>
unbounded queue (this is the default).<br>
<br>
LSUnit implements a relaxed consistency model for memory loads
and stores. The<br>
rules are:<br>
1) A younger load is allowed to pass an older load only if there
is no<br>
intervening store in between the two loads.<br>
2) An younger store is not allowed to pass an older store.<br>
3) A younger store is not allowed to pass an older load.<br>
4) A younger load is allowed to pass an older store provided
that the load does<br>
not alias with the store.<br>
<br>
By default, this class conservatively (i.e. pessimistically)
assumes that loads<br>
always may-alias store operations. Essentially, this LSUnit
doesn't perform any<br>
sort of alias analysis to rule out cases where loads and stores
don't overlap<br>
with each other. The downside of this approach however is that
younger loads are<br>
never allowed to pass older stores. To make it possible for a
younger load to<br>
pass an older store, users can use the command line flag
-noalias. Under<br>
'noalias', a younger load is always allowed to pass an older
store.<br>
<br>
Note that, in the case of write-combining memory, rule 2. could
be relaxed a bit<br>
to allow reordering of non-aliasing store operations. That
being said, at the<br>
moment, there is no way to further relax the memory model (flag
-noalias is the<br>
only option). Essentially, there is no option to specify a
different memory<br>
type (for example: write-back, write-combining, write-through;
etc.) and<br>
consequently to weaken or strengthen the memory model.<br>
<br>
Other limitations are:<br>
* LSUnit doesn't know when store-to-load forwarding may occur.<br>
* LSUnit doesn't know anything about the cache hierarchy and
memory types.<br>
* LSUnit doesn't know how to identify serializing operations
and memory fences.<br>
<br>
No assumption is made on the store buffer size. As mentioned
before, LSUnit<br>
conservatively assumes a may-alias relation between loads and
stores, and it<br>
doesn't attempt to identify cases where store-to-load forwarding
would occur in<br>
practice.<br>
<br>
LSUnit doesn't attempt to predict whether a load or store hits
or misses the L1<br>
cache. It only knows if an instruction "MayLoad" and/or
"MayStore". For loads,<br>
the scheduling model provides an "optimistic" load-to-use
latency (which usually<br>
matches the load-to-use latency for when there is a hit in the
L1D).<br>
<br>
Class MCInstrDesc in LLVM doesn't know about serializing
operations, nor<br>
memory-barrier like instructions. LSUnit conservatively assumes
that an<br>
instruction which has both 'MayLoad' and 'UnmodeledSideEffects'
behaves like a<br>
"soft" load-barrier. That means, it serializes loads without
forcing a flush of<br>
the load queue. Similarly, instructions flagged with both
'MayStore' and<br>
'UnmodeledSideEffects' are treated like store barriers. A full
memory barrier<br>
is a 'MayLoad' and 'MayStore' instruction with
'UnmodeledSideEffects'. This is<br>
inaccurate, but it is the best that we can do at the moment with
the current<br>
information available in LLVM.<br>
<br>
A load/store barrier consumes one entry of the load/store
queue. A load/store<br>
barrier enforces ordering of loads/stores. A younger load
cannot pass a load<br>
barrier. Also, a younger store cannot pass a store barrier. A
younger load has<br>
to wait for the memory/load barrier to execute. A load/store
barrier is<br>
"executed" when it becomes the oldest entry in the load/store
queue(s). That<br>
also means, by construction, all the older loads/stores have
been executed.<br>
<br>
In conclusion the full set of rules is:<br>
1. A store may not pass a previous store.<br>
2. A load may not pass a previous store unless flag 'NoAlias'
is set.<br>
3. A load may pass a previous load.<br>
4. A store may not pass a previous load (regardless of flag
'NoAlias').<br>
5. A load has to wait until an older load barrier is fully
executed.<br>
6. A store has to wait until an older store barrier is fully
executed.<br>
<br>
Known limitations<br>
-----------------<br>
Previous sections described cases where the tool is missing
information to give<br>
an accurate report. For example, the first sections of this
document explained<br>
how the lack of knowledge about the processor negatively affects
the performance<br>
analysis. The lack of knowledge is often a consequence of how
scheduling models<br>
are defined; as mentioned before, scheduling models
intentionally don't describe<br>
processors in fine details.<br>
<br>
The accuracy of the performance analysis is also affected by
assumptions made by<br>
the processor model used by the tool.<br>
<br>
Most recent Intel and AMD processors implement dedicated
LoopBuffer/OpCache in<br>
the hardware frontend to speedup the throughput in the presence
of tight loops.<br>
The presence of these buffers complicates the decoding logic,
and requires<br>
knowledge on the branch predictor too. Class
'SchedMachineModel' in tablegen<br>
provides a field named 'LoopMicroOpBufferSize' which is used to
describe loop<br>
buffers. However, the purpose of that field is to enable loop
unrolling of<br>
tight loops; essentially, it affects the cost model used by pass
loop-unroll.<br>
<br>
By design, the tool only cares about the out-of-order portion of
a processor,<br>
and consequently doesn't try to predict the frontend
throughput. Processors may<br>
implement complex decoding schemes; statically predicting the
frontend<br>
throughput is in general beyond the scope of this tool. For the
same reasons,<br>
this tool intentionally doesn't model branch prediction. That
being said, this<br>
tool could be definitely extended in future to also account for
the hardware<br>
frontend when doing performance analysis. This would inevitably
require extra<br>
(extensive) processor knowledge related to all the available
decoding paths in<br>
the hardware frontend.<br>
<br>
When computing the IPC, the tool assumes a zero-latency
"perfect" fetch&decode<br>
stage; the full sequence of decoded instructions is immediately
visible to the<br>
dispatch logic from the start.<br>
<br>
The tool doesn't know about simultaneous mutithreading.
According to the tool,<br>
processor resources are not statically/dynamically partitioned.
Processor<br>
resources are fully available to the hardware thread executing
the<br>
microbenchmark.<br>
<br>
The execution model implemented by this tool assumes that
instructions are<br>
firstly dispatched in groups to hardware schedulers, and then
issued to<br>
pipelines for execution. The model assumes dynamic scheduling
of instructions.<br>
Instructions are placed in a queue and potentially executed
out-of-order (based<br>
on the operand availability). The dispatch stage is definitely
distinct from the<br>
issue stage.<br>
<br>
This model doesn't correctly describe processors where the
dispatch/issue is a<br>
single stage. This is what happens for example in VLIW
processors, where<br>
instructions are packaged and statically scheduled at compile
time; it is up to<br>
the compiler to predict the latency of instructions and package
issue groups<br>
accordingly. For such targets, there is no dynamic scheduling
done by the<br>
hardware.<br>
<br>
Existing classes (DispatchUnit, Scheduler, etc.) could be
extended/adapted to<br>
support processors with a single dispatch/issue stage. The
execution flow would<br>
require some changes in the way how existing components (i.e.
DispatchUnit,<br>
Scheduler, etc.) interact. This can be a future development.<br>
<br>
The following sections describes other known limitations. The
goal is not to<br>
provide an extensive list of limitations; we want to report what
we believe are<br>
the most important limitations, and suggest possible methods to
overcome them.<br>
<br>
Load/Store barrier instructions and serializing operations<br>
----------------------------------------------------------<br>
Section "Load/Store Unit and Memory Consistency Model" already
mentioned how<br>
LLVM doesn't know about serializing operations and memory
barriers. Most of it<br>
boils down to the fact that class MCInstrDesc (intentionally)
doesn't expose<br>
those properties. Instead, both serializing operations and
memory barriers<br>
"have side-effects" according to MCInstrDesc. That is because,
at least for<br>
scheduling purposes, knowing that an instruction has unmodeled
side effects is<br>
often enough to treat the instruction like a compiler scheduling
barrier.<br>
<br>
A performance analysis tool could use the extra knowledge on
barriers and<br>
serializing operations to generate a more accurate performance
report. One way<br>
to improve this is by reserving a couple of bits in field
'Flags' from class<br>
MCInstrDesc: one bit for barrier operations, and another bit to
mark<br>
instructions as serializing operations.<br>
<br>
Lack of support for instruction itineraries<br>
-------------------------------------------<br>
The current version of the tool doesn't know how to process
instruction<br>
itineraries. This is probably one of the most important
limitations, since it<br>
affects a few out-of-order processors in LLVM.<br>
<br>
As mentioned in section 'Instruction Issue', class Scheduler
delegates to an<br>
instance of class ResourceManager the handling of processor
resources.<br>
ResourceManager is where most of the scheduling logic is
implemented.<br>
<br>
Adding support for instruction itineraries requires that we
teach<br>
ResourceManager how to handle functional units and instruction
stages. This<br>
development can be a future extension, and it would probably
require a few<br>
changes to the ResourceManager interface.<br>
<br>
Instructions that affect control flow are not correctly modeled<br>
---------------------------------------------------------------<br>
Examples of instructions that affect the control flow are:
return, indirect<br>
branches, calls, etc. The tool doesn't try to predict/evaluate
branch targets.<br>
In particular, the tool doesn't model any sort of branch
prediction, nor does it<br>
attempt to track changes to the program counter. The tool
always assumes that<br>
the input assembly sequence is the body of a microbenchmark (a
simple loop<br>
executed for a number of iterations). The "next" instruction in
sequence is<br>
always the next instruction to dispatch.<br>
<br>
Call instructions default to an arbitrary high latency of 100cy.
A warning is<br>
generated if the tool encounters a call instruction in the
sequence. Return<br>
instructions are not evaluated, and therefore control flow is
not affected.<br>
However, the tool still queries the processor scheduling model
to obtain latency<br>
information for instructions that affect the control flow.<br>
<br>
Possible extensions to the scheduling model<br>
-------------------------------------------<br>
Section "Instruction Dispatch" explained how the tool doesn't
know about the<br>
register files, and temporaries available in each register file
for register<br>
renaming purposes.<br>
<br>
The LLVM scheduling model could be extended to better describe
register files.<br>
Ideally, scheduling model should be able to define:<br>
- The size of each register file<br>
- How many temporary registers are available for register
renaming<br>
- How register classes map to register files<br>
<br>
The scheduling model doesn't specify the retire throughput (i.e.
how many<br>
instructions can be retired every cycle). Users can specify
flag<br>
`-max-retire-per-cycle=<uint>` to limit how many
instructions the retire control<br>
unit can retire every cycle. Ideally, every processor should be
able to specify<br>
the retire throughput (for example, by adding an extra field to
the scheduling<br>
model tablegen class).<br>
<br>
Known limitations on X86 processors<br>
-----------------------------------<br>
<br>
1) Partial register updates versus full register updates.<br>
<br>
On x86-64, a 32-bit GPR write fully updates the super-register.
Example:<br>
add %edi %eax ## eax += edi<br>
<br>
Here, register %eax aliases the lower half of 64-bit register
%rax. On x86-64,<br>
register %rax is fully updated by the 'add' (the upper half of
%rax is zeroed).<br>
Essentially, it "kills" any previous definition of (the upper
half of) register<br>
%rax.<br>
<br>
On the other hand, 8/16 bit register writes only perform a
so-called "partial<br>
register update". Example:<br>
add %di, %ax ## ax += di<br>
<br>
Here, register %eax is only partially updated. To be more
specific, the lower<br>
half of %eax is set, and the upper half is left unchanged. There
is also no<br>
change in the upper 48 bits of register %rax.<br>
<br>
To get accurate performance analysis, the tool has to know which
instructions<br>
perform a partial register update, and which instructions fully
update the<br>
destination's super-register.<br>
<br>
One way to expose this information is (again) via tablegen. For
example, we<br>
could add a flag in the tablegen instruction class to tag
instructions that<br>
perform partial register updates. Something like this: 'bit<br>
hasPartialRegisterUpdate = 1'. However, this would force a `let<br>
hasPartialRegisterUpdate = 0` on several instruction
definitions.<br>
<br>
Another approach is to have a MCSubtargetInfo hook similar to
this:<br>
virtual bool updatesSuperRegisters(unsigned short opcode) {
return false; }<br>
<br>
Targets will be able to override this method if needed. Again,
this is just an<br>
idea. But the plan is to have this fixed as a future
development.<br>
<br>
2) Macro Op fusion.<br>
<br>
The tool doesn't know about macro-op fusion. On modern x86
processors, a<br>
'cmp/test' followed by a 'jmp' is fused into a single macro
operation. The<br>
advantage is that the fused pair only consumes a single slot in
the dispatch<br>
group. <br>
<br>
As a future development, the tool should be extended to address
macro-fusion.<br>
Ideally, we could have LLVM generate a table enumerating all the
opcode pairs<br>
that can be fused together. That table could be exposed to the
tool via the<br>
MCSubtargetInfo interface. This is just an idea; there may be
better ways to<br>
implement this.<br>
<br>
3) Intel processors: mixing legacy SSE with AVX instructions.<br>
<br>
On modern Intel processors with AVX, mixing legacy SSE code with
AVX code<br>
negatively impacts the performance. The tool is not aware of
this issue, and<br>
the performance penalty is not accounted when doing the
analysis. This is<br>
something that we would like to improve in future.<br>
<br>
4) Zero-latency register moves and Zero-idioms.<br>
<br>
Most modern AMD/Intel processors know how to optimize out
register-register<br>
moves and zero idioms at register renaming stage. The tool
doesn't know<br>
about these patterns, and this may negatively impact the
performance analysis.<br>
<br>
Known design problems<br>
---------------------<br>
This section describes two design issues that are currently
affecting the tool.<br>
The long term plan is to "fix" these issues.<br>
<br>
1) Variant instructions not correctly modeled.<br>
<br>
The tool doesn't know how to analyze instructions with a
"variant" scheduling<br>
class descriptor. A variant scheduling class needs to be
resolved dynamically.<br>
The "actual" scheduling class often depends on the subtarget, as
well as<br>
properties of the specific MachineInstr object.<br>
<br>
Unfortunately, the tool manipulates MCInst, and it doesn't know
anything about<br>
MachineInstr. As a consequence, the tool cannot use the existing
machine<br>
subtarget hooks that are normally used to resolve the variant
scheduling class.<br>
This is a major design issue which mostly affects ARM/AArch64
targets. It<br>
mostly boils down to the fact that the existing scheduling
framework was meant<br>
to work for MachineInstr.<br>
<br>
When the tool encounters a "variant" instruction, it assumes a
generic 1cy<br>
latency. However, the tool would not be able to tell which
processor resources<br>
are effectively consumed by the variant instruction.<br>
<br>
2) MCInst and MCInstrDesc.<br>
<br>
Performance analysis tools require data dependency information
to correctly<br>
predict the runtime performance of the code. This tool must
always be able to<br>
obtain the set of implicit/explicit register defs/uses for every
instruction of<br>
the input assembly sequence.<br>
<br>
In the first section of this document, it was mentioned how the
tool takes as<br>
input an assembly sequence. That sequence is parsed into a
MCInst sequence with<br>
the help of assembly parsers available from the targets.<br>
<br>
A MCInst is a very low-level instruction representation. The
tool can inspect<br>
the MCOperand sequence of an MCInst to identify register
operands. However,<br>
there is no way to tell register operands that are definitions
from register<br>
operands that are uses.<br>
<br>
In LLVM, class MCInstrDesc is used to fully describe target
instructions and<br>
their operands. The opcode of a machine instruction (a
MachineInstr object) can<br>
be used to query the instruction set through method
`MCInstrInfo::get' to obtain<br>
the associated MCInstrDesc object.<br>
<br>
However class MCInstrDesc describes properties and operands of
MachineInstr<br>
objects. Essentially, MCInstrDesc is not meant to be used to
describe MCInst<br>
objects. To be more specific, MCInstrDesc objects are
automatically generated<br>
via tablegen from the instruction set description in the target
.td files. For<br>
example, field `MCInstrDesc::NumDefs' is always equal to the
cardinality of the<br>
`(outs)` set from the tablegen instruction definition.<br>
<br>
By construction, register definitions always appear at the
beginning of the<br>
MachineOperands list in MachineInstr. Basically, the (outs) are
the first<br>
operands of a MachineInstr, and the (ins) will come after in the
machine operand<br>
list. Knowing the number of register definitions is enough to
identify<br>
all the register operands that are definitions.<br>
<br>
In a normal compilation process, MCInst objects are generated
from MachineInstr<br>
objects through a lowering step. By default the lowering logic
simply iterates<br>
over the machine operands of a MachineInstr, and
converts/expands them into<br>
equivalent MCOperand objects.<br>
<br>
The default lowering strategy has the advantage of preserving
all of the<br>
above mentioned assumptions on the machine operand sequence.
That means, register<br>
definitions would still be at the beginning of the MCOperand
sequence, and<br>
register uses would come after.<br>
<br>
Targets may still define custom lowering routines for specific
opcodes. Some of<br>
these routines may lower operands in a way that potentially
breaks (some of) the<br>
assumptions on the machine operand sequence which were valid for
MachineInstr.<br>
Luckily, this is not the most common form of lowering done by
the targets, and<br>
the vast majority of the MachineInstr are lowered based on the
default strategy<br>
which preserves the original machine operand sequence. This is
especially true<br>
for x86, where the custom lowering logic always preserves the
original (i.e.<br>
from the MachineInstr) operand sequence.<br>
<br>
This tool currently works under the strong (and potentially
incorrect)<br>
assumption that register def/uses in a MCInst can always be
identified by<br>
querying the machine instruction descriptor for the opcode. This
assumption made<br>
it possible to develop this tool and get good numbers at least
for the<br>
processors available in the x86 backend.<br>
<br>
That being said, the analysis is still potentially incorrect for
other targets.<br>
So we plan (with the help of the community) to find a proper
mechanism to map<br>
when possible MCOperand indices back to MachineOperand indices
of the equivalent<br>
MachineInstr. This would be equivalent to describing changes
made by the<br>
lowering step which affected the operand sequence. For example,
we could have an<br>
index for every register MCOperand (or -1, if the operand didn't
exist in the<br>
original MachineInstr). The mapping could look like this
<0,1,3,2>. Here,<br>
MCOperand #2 was obtained from the lowering of MachineOperand
#3. etc.<br>
<br>
This information could be automatically generated via tablegen
for all the<br>
instructions whose custom lowering step breaks assumptions made
by the tool on<br>
the register operand sequence (In general, these instructions
should be the<br>
minority of a target's instruction set). Unfortunately, we don't
have that<br>
information now. As a consequence, we assume that the number of
explicit<br>
register definitions is the same number specified in
MCInstrDesc. We also<br>
assume that register definitions always come first in the
operand sequence.<br>
<br>
In conclusion: these are for now the strong assumptions made by
the tool:<br>
* The number of explicit and implicit register definitions in
a MCInst<br>
matches the number of explicit and implicit definitions
specified by the<br>
MCInstrDesc object.<br>
* Register uses always come after register definitions.<br>
* If an opcode specifies an optional definition, then the
optional<br>
definition is always the last register operand in the
sequence.<br>
<br>
Note that some of the information accessible from the
MCInstrDesc is always<br>
valid for MCInst. For example: implicit register defs, implicit
register uses<br>
and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties
still apply to<br>
MCInst. The tool knows about this, and uses that information
during its<br>
analysis.<br>
<br>
What to do next<br>
---------------<br>
The source code has been uploaded for review on phabricator at
this link: <a href="https://reviews.llvm.org/D43951"
moz-do-not-send="true">https://reviews.llvm.org/D43951</a>.<br>
<br>
The review covers two patches:<br>
A first (very small) patch that always enables the generation of
processor<br>
resource names in the SubtargetEmitter. Currently, the processor
resource names<br>
are only generated for debugging purposes, but are needed by the
tool to<br>
generate user friendly reports, so we would like to always
generate them.<br>
A second patch with the actual static analysis tool (in
llvm/tools).<br>
<br>
Once these first two patches are committed, the plan is to keep
working on the<br>
tool with the help of the community to address all of the
limitations described<br>
by the previous sections, and find good solutions/fixes for the
design issues<br>
described by section "Known design problems".<br>
<br>
We hope the community will find this tool useful like we have.<br>
<br>
Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg
Bedwell who really<br>
helped me a lot by suggesting improvements and testing the tool.<br>
<br>
Thanks for your time.<br>
-Andrea<br>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>