[llvm-bugs] [Bug 51495] New: [SchedModel] Inconsistent use of WriteIMulH in the scheduling models.
via llvm-bugs
llvm-bugs at lists.llvm.org
Mon Aug 16 10:28:17 PDT 2021
https://bugs.llvm.org/show_bug.cgi?id=51495
Bug ID: 51495
Summary: [SchedModel] Inconsistent use of WriteIMulH in the
scheduling models.
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: andrea.dibiagio at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, pengfei.wang at intel.com,
spatel+llvm at rotateright.com
On X86, WriteIMulH provide scheduling information for the second register write
of a MUL/MULX.
However, it turns out that MULX is the only instruction that uses it. For all
other multiply instructions, we only provide a single "SchedWrite", even though
the number of register definitions for most variants is 3 (i.e. RAX, RDX,
EFLAGS).
Example:
Instructions MUL16r MUL32r and MUL64r implicitly write EAX and EDX. For
MUL32r/m, registers EAX contains the LOW part of the result, while register EDX
contains the HIGH part of the result.
Example:
```
mulq %rdi
```
Example: -mcpu=haswell -debug-only=llvm-mca
```
[Def][I] OpIdx=0, PhysReg=RAX, Latency=4, WriteResourceID=0
[Def][I] OpIdx=1, PhysReg=RDX, Latency=4, WriteResourceID=0
[Def][I] OpIdx=2, PhysReg=EFLAGS, Latency=4, WriteResourceID=0
[Use] OpIdx=0, UseIndex=0
[Use][I] OpIdx=0, UseIndex=1, RegisterID=RAX
MaxLatency=4
NumMicroOps=2
```
By default, the definition of MUL64r only declares a single SchedWrite of
latency 4cy for the implicit write to register RAX.
There is no SchedWrite associated with RDX and EFLAGS. In the absence of
scheduling information, llvm-mca conservatively assumes a default latency of 4
cycles too (i.e. the max instruction latency for MUL64r), and no extra resource
consumption.
I strongly believe that the intention was to use WriteIMulH to model perf info
for the HIGH part. Currently, WriteIMulH is unused by "normal" scalar integer
multiply.
It turns out that MULX is the only user of WriteIMulH.
Example:
```
mulxq %rdi, %rax, %rcx
```
> llvm-mca -mcpu=haswell -debug-only=llvm-mca
```
Opcode Name= MULX64rr
SchedClassID=955
[Def] OpIdx=0, Latency=4, WriteResourceID=0
[Def] OpIdx=1, Latency=3, WriteResourceID=0
[Use] OpIdx=2, UseIndex=0
[Use][I] OpIdx=0, UseIndex=1, RegisterID=RDX
MaxLatency=4
NumMicroOps=3
```
Note the difference in latency for the two register writes.
Here, the definition of RAX is associated with WriteIMulH. Also, WriteIMulH is
3cy for Haswell.
It also means that this timeline is valid:
```
Timeline view:
Index 01234567
[0,0] DeeeeER. mulxq %rdi, %rax, %rcx
[0,1] D===eER. addq %rax, %rax
```
RAX is available one cycle before the MULX is fully executed.
Here is the problem:
WriteIMulH is also used by the RM variants of MULX. However, the latency of
WriteIMulH (for all the x86 models) is not "load aware". So, a wrong number of
cycles is used for the RM variants (see below)
Example:
```
mulxq (%rsp), %rax, %rcx
addq %rax, %rax
```
> llvm-mca -mcpu=haswell -debug-only=llvm-mca
```
Opcode Name= MULX64rm
SchedClassID=956
[Def] OpIdx=0, Latency=9, WriteResourceID=0
[Def] OpIdx=1, Latency=3, WriteResourceID=0 <== !!
[Use] OpIdx=2, UseIndex=0
[Use] OpIdx=4, UseIndex=2
[Use] OpIdx=6, UseIndex=4
[Use][I] OpIdx=0, UseIndex=5, RegisterID=RDX
MaxLatency=9
NumMicroOps=4
```
Notice how the HIGH part is written in just 3cy (i.e. it is written BEFORE the
load operand is even computed)!
That is because WriteIMulH declares a number of cycles which (on all upstream
targets at least) only makes sense for the RR variants.
If we run that example for 1 iteration, we get this impossible profile:
```
Timeline view:
01
Index 0123456789
[0,0] DeeeeeeeeeER mulxq (%rsp), %rax, %rcx
[0,1] .D==eE-----R addq %rax, %rax
```
In conclusion, I believe that there are two problems here:
2) I believe that WriteIMulH should also be used by normal multiply
instructions, and not just by MULX. We shouldn't rely on the "default" llvm-mca
behaviour for the absence of writes.
1) WriteIMulH should not be used by both RR and RM multiply variants. We need a
specialised version of WriteIMulH for the case where one of the inputs is a
memory operand. The latency of the HIGH part should keep into account the extra
load latency.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210816/41e03a2f/attachment.html>
More information about the llvm-bugs
mailing list