[PATCH] D60441: [X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.

Tue Apr 9 09:48:13 PDT 2019

andreadb added a comment.

The reason why there is a regression is because this patch adds an extra input operand to the following instructions:

  cvtsi2ssl  %ecx, %xmm0
  cvtsi2sdl  %ecx, %xmm0

For example:

  cvtsi2ssl       %ecx, %xmm0     # <MCInst #775 CVTSI2SSrr_Int
                                  #  <MCOperand Reg:142>
                                  #  <MCOperand Reg:142>
                                  #  <MCOperand Reg:25>>

XMM0 is now an in/out operand. Before this patch, XMM0 was only an output register.
This is equivalent to introducing a false dependency on the output register. That is what causes the extra latency from those two mca tests.

There is a way to workaround the problem introduced by the presence of that extra register read.
We can force that extra read to always have zero-latency by introducing a special "ReadAdvance" definition.

  --- X86Schedule.td      (revision 357997)
  +++ X86Schedule.td      (working copy)
  @@ -22,6 +22,7 @@
   // This SchedRead describes a bypass delay caused by data being moved from the
   // integer unit to the floating point unit.
   def ReadInt2Fpu : SchedRead;
  +def ReadAfterInt2Fpu : SchedRead;

For BdVer2 `ReadAfterInt2Fpu ` would be defined as:

  def : ReadAdvance<ReadAfterInt2Fpu, 13>;

For BtVer2 it would be defined as follows:

  def : ReadAdvance<ReadAfterInt2Fpu, 7>;

The last step is to add `ReadAfterInt2Fpu` to the schedule read/write list of SSE CVTSI*_Int definitions.
In my experimental change, I had to change multiclass `sse12_cvt_sint_3addr` by adding an extra param (see below):

  Index: X86InstrSSE.td
  ===================================================================
  --- X86InstrSSE.td      (revision 357997)
  +++ X86InstrSSE.td      (working copy)
  @@ -1026,6 +989,7 @@
   multiclass sse12_cvt_sint_3addr<bits<8> opc, RegisterClass SrcRC,
                       RegisterClass DstRC, X86MemOperand x86memop,
                       string asm, X86FoldableSchedWrite sched,
  +                    SchedRead ReadAdv = ReadDefault,
                       bit Is2Addr = 1> {

That required the following changes too:

  +let Predicates = [UseAVX] in {
  +defm VCVTSI2SS : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
  +          i32mem, "cvtsi2ss{l}", WriteCvtI2SS, ReadDefault, 0>, XS, VEX_4V;
  +defm VCVTSI642SS : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
  +          i64mem, "cvtsi2ss{q}", WriteCvtI2SS, ReadDefault, 0>, XS, VEX_4V, VEX_W;
  +defm VCVTSI2SD : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
  +          i32mem, "cvtsi2sd{l}", WriteCvtI2SD, ReadDefault, 0>, XD, VEX_4V;
  +defm VCVTSI642SD : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
  +          i64mem, "cvtsi2sd{q}", WriteCvtI2SD, ReadDefault, 0>, XD, VEX_4V, VEX_W;
  +}
  +let Constraints = "$src1 = $dst" in {
  +  defm CVTSI2SS : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
  +                        i32mem, "cvtsi2ss{l}", WriteCvtI2SS, ReadAfterInt2Fpu>, XS;
  +  defm CVTSI642SS : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
  +                        i64mem, "cvtsi2ss{q}", WriteCvtI2SS, ReadAfterInt2Fpu>, XS, REX_W;
  +  defm CVTSI2SD : sse12_cvt_sint_3addr<0x2A, GR32, VR128,
  +                        i32mem, "cvtsi2sd{l}", WriteCvtI2SD, ReadAfterInt2Fpu>, XD;
  +  defm CVTSI642SD : sse12_cvt_sint_3addr<0x2A, GR64, VR128,
  +                        i64mem, "cvtsi2sd{q}", WriteCvtI2SD, ReadAfterInt2Fpu>, XD, REX_W;
  +}

This was enough to avoid the regression on BtVer2.

There is one last change required for BdVer2:

  ===================================================================
  --- X86ScheduleBdVer2.td        (revision 357997)
  +++ X86ScheduleBdVer2.td        (working copy)
  @@ -898,7 +899,8 @@
     let Latency = 13;
     let NumMicroOps = 2;
   }
  -def : InstRW<[PdWriteCVTSI642SDrr_CVTSI642SSrr_CVTSI2SDr_CVTSI2SSrr], (instrs CVTSI642SDrr, CVTSI642SSrr, CVTSI2SDrr, CVTSI2SSrr)>;
  +def : InstRW<[PdWriteCVTSI642SDrr_CVTSI642SSrr_CVTSI2SDr_CVTSI2SSrr, ReadAfterInt2Fpu], (instrs CVTSI642SDrr, CVTSI642SSrr, CVTSI2SDrr, CVTSI2SSrr,
  +                                                                              CVTSI642SDrr_Int, CVTSI642SSrr_Int, CVTSI2SDrr_Int, CVTSI2SSrr_Int)>;

Basically, we need to explicitly pass `ReadAfterInt2Fpu` to that InstRW. Otherwise the regression would not go away.

Tbh. I don't know if there is another way to fix that regression.
If we want to fix it, then this may be a way to do it.

I hope it helps.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D60441/new/

https://reviews.llvm.org/D60441