<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [SchedModel] Inconsistent use of WriteIMulH in the scheduling models."
href="https://bugs.llvm.org/show_bug.cgi?id=51495">51495</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[SchedModel] Inconsistent use of WriteIMulH in the scheduling models.
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Windows NT
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>andrea.dibiagio@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com
</td>
</tr></table>
<p>
<div>
<pre>On X86, WriteIMulH provide scheduling information for the second register write
of a MUL/MULX.
However, it turns out that MULX is the only instruction that uses it. For all
other multiply instructions, we only provide a single "SchedWrite", even though
the number of register definitions for most variants is 3 (i.e. RAX, RDX,
EFLAGS).
Example:
Instructions MUL16r MUL32r and MUL64r implicitly write EAX and EDX. For
MUL32r/m, registers EAX contains the LOW part of the result, while register EDX
contains the HIGH part of the result.
Example:
```
mulq %rdi
```
Example: -mcpu=haswell -debug-only=llvm-mca
```
[Def][I] OpIdx=0, PhysReg=RAX, Latency=4, WriteResourceID=0
[Def][I] OpIdx=1, PhysReg=RDX, Latency=4, WriteResourceID=0
[Def][I] OpIdx=2, PhysReg=EFLAGS, Latency=4, WriteResourceID=0
[Use] OpIdx=0, UseIndex=0
[Use][I] OpIdx=0, UseIndex=1, RegisterID=RAX
MaxLatency=4
NumMicroOps=2
```
By default, the definition of MUL64r only declares a single SchedWrite of
latency 4cy for the implicit write to register RAX.
There is no SchedWrite associated with RDX and EFLAGS. In the absence of
scheduling information, llvm-mca conservatively assumes a default latency of 4
cycles too (i.e. the max instruction latency for MUL64r), and no extra resource
consumption.
I strongly believe that the intention was to use WriteIMulH to model perf info
for the HIGH part. Currently, WriteIMulH is unused by "normal" scalar integer
multiply.
It turns out that MULX is the only user of WriteIMulH.
Example:
```
mulxq %rdi, %rax, %rcx
```
<span class="quote">> llvm-mca -mcpu=haswell -debug-only=llvm-mca</span >
```
Opcode Name= MULX64rr
SchedClassID=955
[Def] OpIdx=0, Latency=4, WriteResourceID=0
[Def] OpIdx=1, Latency=3, WriteResourceID=0
[Use] OpIdx=2, UseIndex=0
[Use][I] OpIdx=0, UseIndex=1, RegisterID=RDX
MaxLatency=4
NumMicroOps=3
```
Note the difference in latency for the two register writes.
Here, the definition of RAX is associated with WriteIMulH. Also, WriteIMulH is
3cy for Haswell.
It also means that this timeline is valid:
```
Timeline view:
Index 01234567
[0,0] DeeeeER. mulxq %rdi, %rax, %rcx
[0,1] D===eER. addq %rax, %rax
```
RAX is available one cycle before the MULX is fully executed.
Here is the problem:
WriteIMulH is also used by the RM variants of MULX. However, the latency of
WriteIMulH (for all the x86 models) is not "load aware". So, a wrong number of
cycles is used for the RM variants (see below)
Example:
```
mulxq (%rsp), %rax, %rcx
addq %rax, %rax
```
<span class="quote">> llvm-mca -mcpu=haswell -debug-only=llvm-mca</span >
```
Opcode Name= MULX64rm
SchedClassID=956
[Def] OpIdx=0, Latency=9, WriteResourceID=0
[Def] OpIdx=1, Latency=3, WriteResourceID=0 <== !!
[Use] OpIdx=2, UseIndex=0
[Use] OpIdx=4, UseIndex=2
[Use] OpIdx=6, UseIndex=4
[Use][I] OpIdx=0, UseIndex=5, RegisterID=RDX
MaxLatency=9
NumMicroOps=4
```
Notice how the HIGH part is written in just 3cy (i.e. it is written BEFORE the
load operand is even computed)!
That is because WriteIMulH declares a number of cycles which (on all upstream
targets at least) only makes sense for the RR variants.
If we run that example for 1 iteration, we get this impossible profile:
```
Timeline view:
01
Index 0123456789
[0,0] DeeeeeeeeeER mulxq (%rsp), %rax, %rcx
[0,1] .D==eE-----R addq %rax, %rax
```
In conclusion, I believe that there are two problems here:
2) I believe that WriteIMulH should also be used by normal multiply
instructions, and not just by MULX. We shouldn't rely on the "default" llvm-mca
behaviour for the absence of writes.
1) WriteIMulH should not be used by both RR and RM multiply variants. We need a
specialised version of WriteIMulH for the case where one of the inputs is a
memory operand. The latency of the HIGH part should keep into account the extra
load latency.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>