<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [SchedModel] Inconsistent use of WriteIMulH in the scheduling models."

   href="https://bugs.llvm.org/show_bug.cgi?id=51495">51495</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[SchedModel] Inconsistent use of WriteIMulH in the scheduling models.

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>andrea.dibiagio@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>On X86, WriteIMulH provide scheduling information for the second register write

of a MUL/MULX.

However, it turns out that MULX is the only instruction that uses it. For all

other multiply instructions, we only provide a single "SchedWrite", even though

the number of register definitions for most variants is 3 (i.e. RAX, RDX,

EFLAGS).

Example:

Instructions MUL16r MUL32r and MUL64r implicitly write EAX and EDX. For

MUL32r/m, registers EAX contains the LOW part of the result, while register EDX

contains the HIGH part of the result.

Example:

```

mulq    %rdi

```

Example: -mcpu=haswell -debug-only=llvm-mca

```

                [Def][I] OpIdx=0, PhysReg=RAX, Latency=4, WriteResourceID=0

                [Def][I] OpIdx=1, PhysReg=RDX, Latency=4, WriteResourceID=0

                [Def][I] OpIdx=2, PhysReg=EFLAGS, Latency=4, WriteResourceID=0

                [Use]    OpIdx=0, UseIndex=0

                [Use][I] OpIdx=0, UseIndex=1, RegisterID=RAX

                MaxLatency=4

                NumMicroOps=2

```

By default, the definition of MUL64r only declares a single SchedWrite of

latency 4cy for the implicit write to register RAX.

There is no SchedWrite associated with RDX and EFLAGS. In the absence of

scheduling information, llvm-mca conservatively assumes a default latency of 4

cycles too (i.e. the max instruction latency for MUL64r), and no extra resource

consumption.

I strongly believe that the intention was to use WriteIMulH to model perf info

for the HIGH part. Currently, WriteIMulH is unused by "normal" scalar integer

multiply.

It turns out that MULX is the only user of WriteIMulH.

Example:

```

mulxq %rdi, %rax, %rcx

```

<span class="quote">> llvm-mca -mcpu=haswell -debug-only=llvm-mca</span >

```

                Opcode Name= MULX64rr

                SchedClassID=955

                [Def]    OpIdx=0, Latency=4, WriteResourceID=0

                [Def]    OpIdx=1, Latency=3, WriteResourceID=0

                [Use]    OpIdx=2, UseIndex=0

                [Use][I] OpIdx=0, UseIndex=1, RegisterID=RDX

                MaxLatency=4

                NumMicroOps=3

```

Note the difference in latency for the two register writes.

Here, the definition of RAX is associated with WriteIMulH. Also, WriteIMulH is

3cy for Haswell.

It also means that this timeline is valid:

```

Timeline view:

Index     01234567

[0,0]     DeeeeER.   mulxq      %rdi, %rax, %rcx

[0,1]     D===eER.   addq       %rax, %rax

```

RAX is available one cycle before the MULX is fully executed.

Here is the problem:

WriteIMulH is also used by the RM variants of MULX. However, the latency of

WriteIMulH (for all the x86 models) is not "load aware". So, a wrong number of

cycles is used for the RM variants (see below)

Example:

```

mulxq  (%rsp), %rax, %rcx

addq   %rax, %rax

```

<span class="quote">> llvm-mca -mcpu=haswell -debug-only=llvm-mca</span >

```

               Opcode Name= MULX64rm

                SchedClassID=956

                [Def]    OpIdx=0, Latency=9, WriteResourceID=0

                [Def]    OpIdx=1, Latency=3, WriteResourceID=0  <== !!

                [Use]    OpIdx=2, UseIndex=0

                [Use]    OpIdx=4, UseIndex=2

                [Use]    OpIdx=6, UseIndex=4

                [Use][I] OpIdx=0, UseIndex=5, RegisterID=RDX

                MaxLatency=9

                NumMicroOps=4

```

Notice how the HIGH part is written in just 3cy (i.e. it is written BEFORE the

load operand is even computed)!

That is because WriteIMulH declares a number of cycles which (on all upstream

targets at least) only makes sense for the RR variants.

If we run that example for 1 iteration, we get this impossible profile:

```

Timeline view:

                    01

Index     0123456789

[0,0]     DeeeeeeeeeER   mulxq  (%rsp), %rax, %rcx

[0,1]     .D==eE-----R   addq   %rax, %rax

```

In conclusion, I believe that there are two problems here:

2) I believe that WriteIMulH should also be used by normal multiply

instructions, and not just by MULX. We shouldn't rely on the "default" llvm-mca

behaviour for the absence of writes.

1) WriteIMulH should not be used by both RR and RM multiply variants. We need a

specialised version of WriteIMulH for the case where one of the inputs is a

memory operand. The latency of the HIGH part should keep into account the extra

load latency.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>