[llvm-bugs] [Bug 51495] [SchedModel] Inconsistent use of WriteIMulH in the scheduling models.

Wed Aug 25 12:04:13 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=51495

Andrea Di Biagio <andrea.dibiagio at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

--- Comment #7 from Andrea Di Biagio <andrea.dibiagio at gmail.com> ---
(In reply to Roman Lebedev from comment #6)
> (In reply to Andrea Di Biagio from comment #5)
> > (In reply to Roman Lebedev from comment #4)
> > > (In reply to Andrea Di Biagio from comment #3)
> > > > Fixed by commit 5f848b311f16
> > > 
> > > I think we have at least one more problem - are we actually modelling write
> > > of high part correctly?
> > > Usually the low part of multiplication is avaliable 1cy before than the high
> > > part.
> > > 
> > > So given `mulxl	%eax, %eax, %ecx`, it's latency is 3,
> > > since the high part (%ecx) is unused,
> > > but `mulxl	%eax, %eax, %eax` should have latency of 4.
> > > 
> > > But look how e.g. haswell models this: https://godbolt.org/z/qjcTc4zY1 ?!?! 
> > > Is it really correct that in both of the cases the multiplication can start
> > > while the high part hasn't been written yet?
> > 
> > I was also talking about this issue with Simon before.
> > We also think that those numbers look incorrect.
> > 
> > I believe that Haswell model has a bug; the latency of the HI part is
> > strangely lower than the latency of the LO part. In theory, it should be the
> > other way round.
> The reason i'm bringing this up is that if fix it,
> then it still doesn't work, and then they all take 4cy.
> 

I think I understand what you mean by "they all take 4cy".

I tried to fix the MULX writes in Haswell by inverting the latencies for the LO
and HI parts. At that point, I was able to reproduce the issue (i.e. writes all
take 4cy).

It turns out that all this time, we were wrongly assuming that WriteIMulH was
describing the HI register write of MULX.

It turns out that writes are assigned in the wrong order, and what we call
WriteIMulH is actually the LO register write. WriteMULX is currently in "HI"
position.

It becomes more obvious if we look at the llvm-mc output:

> llvm-mc mulx.s -show-inst
```
        .text
        mulxl   %eax, %eax, %ecx                # <MCInst #1967 MULX32rr
                                        #  <MCOperand Reg:25>
                                        #  <MCOperand Reg:22>
                                        #  <MCOperand Reg:22>>
```

In this example, the high half of the result is in register ECX (i.e. operand
at index #0), while the LO part of the result is in register EAX (i.e. register
index #1).

NOTE: EAX is Reg:22; ECX is Reg:25.

WriteMULX32 was therefore assigned to the write at index #0, while WriteIMulH
was assigned to the register write at index #1.

That explains why the definition of WriteIMulH always had a lower latency for
Intel models.
We can easily fix this by inverting the order of writes in the tablegen
definition of MULX.

Unfortunately, that's not enough to fix the problem that you are seeing.
There is another bug in mca: if an instruction performes two writes to the same
register, then the last write overrides the first write. That's why you are
still seeing the same latency for the case where all the destination registers
of MULX are EAX.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210825/9549ae87/attachment.html>