[llvm-bugs] [Bug 51495] New: [SchedModel] Inconsistent use of WriteIMulH in the scheduling models.

Mon Aug 16 10:28:17 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=51495

            Bug ID: 51495
           Summary: [SchedModel] Inconsistent use of WriteIMulH in the
                    scheduling models.
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: andrea.dibiagio at gmail.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, pengfei.wang at intel.com,
                    spatel+llvm at rotateright.com

On X86, WriteIMulH provide scheduling information for the second register write
of a MUL/MULX.

However, it turns out that MULX is the only instruction that uses it. For all
other multiply instructions, we only provide a single "SchedWrite", even though
the number of register definitions for most variants is 3 (i.e. RAX, RDX,
EFLAGS).

Example:
Instructions MUL16r MUL32r and MUL64r implicitly write EAX and EDX. For
MUL32r/m, registers EAX contains the LOW part of the result, while register EDX
contains the HIGH part of the result.

Example:

```
mulq    %rdi
```

Example: -mcpu=haswell -debug-only=llvm-mca

```
                [Def][I] OpIdx=0, PhysReg=RAX, Latency=4, WriteResourceID=0
                [Def][I] OpIdx=1, PhysReg=RDX, Latency=4, WriteResourceID=0
                [Def][I] OpIdx=2, PhysReg=EFLAGS, Latency=4, WriteResourceID=0
                [Use]    OpIdx=0, UseIndex=0
                [Use][I] OpIdx=0, UseIndex=1, RegisterID=RAX
                MaxLatency=4
                NumMicroOps=2
```

By default, the definition of MUL64r only declares a single SchedWrite of
latency 4cy for the implicit write to register RAX.

There is no SchedWrite associated with RDX and EFLAGS. In the absence of
scheduling information, llvm-mca conservatively assumes a default latency of 4
cycles too (i.e. the max instruction latency for MUL64r), and no extra resource
consumption.

I strongly believe that the intention was to use WriteIMulH to model perf info
for the HIGH part. Currently, WriteIMulH is unused by "normal" scalar integer
multiply.

It turns out that MULX is the only user of WriteIMulH.

Example:

```
mulxq %rdi, %rax, %rcx
```

> llvm-mca -mcpu=haswell -debug-only=llvm-mca

```
                Opcode Name= MULX64rr
                SchedClassID=955
                [Def]    OpIdx=0, Latency=4, WriteResourceID=0
                [Def]    OpIdx=1, Latency=3, WriteResourceID=0
                [Use]    OpIdx=2, UseIndex=0
                [Use][I] OpIdx=0, UseIndex=1, RegisterID=RDX
                MaxLatency=4
                NumMicroOps=3
```

Note the difference in latency for the two register writes.
Here, the definition of RAX is associated with WriteIMulH. Also, WriteIMulH is
3cy for Haswell.

It also means that this timeline is valid:

```
Timeline view:
Index     01234567

[0,0]     DeeeeER.   mulxq      %rdi, %rax, %rcx
[0,1]     D===eER.   addq       %rax, %rax
```

RAX is available one cycle before the MULX is fully executed.

Here is the problem:

WriteIMulH is also used by the RM variants of MULX. However, the latency of
WriteIMulH (for all the x86 models) is not "load aware". So, a wrong number of
cycles is used for the RM variants (see below)

Example:
```
mulxq  (%rsp), %rax, %rcx
addq   %rax, %rax
```

> llvm-mca -mcpu=haswell -debug-only=llvm-mca

```
               Opcode Name= MULX64rm
                SchedClassID=956
                [Def]    OpIdx=0, Latency=9, WriteResourceID=0
                [Def]    OpIdx=1, Latency=3, WriteResourceID=0  <== !!
                [Use]    OpIdx=2, UseIndex=0
                [Use]    OpIdx=4, UseIndex=2
                [Use]    OpIdx=6, UseIndex=4
                [Use][I] OpIdx=0, UseIndex=5, RegisterID=RDX
                MaxLatency=9
                NumMicroOps=4
```

Notice how the HIGH part is written in just 3cy (i.e. it is written BEFORE the
load operand is even computed)!

That is because WriteIMulH declares a number of cycles which (on all upstream
targets at least) only makes sense for the RR variants.

If we run that example for 1 iteration, we get this impossible profile:

```
Timeline view:
                    01
Index     0123456789

[0,0]     DeeeeeeeeeER   mulxq  (%rsp), %rax, %rcx
[0,1]     .D==eE-----R   addq   %rax, %rax
```

In conclusion, I believe that there are two problems here:

2) I believe that WriteIMulH should also be used by normal multiply
instructions, and not just by MULX. We shouldn't rely on the "default" llvm-mca
behaviour for the absence of writes.

1) WriteIMulH should not be used by both RR and RM multiply variants. We need a
specialised version of WriteIMulH for the case where one of the inputs is a
memory operand. The latency of the HIGH part should keep into account the extra
load latency.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210816/41e03a2f/attachment.html>