[PATCH] D33099: AMD Jaguar scheduler doesn't correctly model 256-bit AVX instructions

Wed Jun 7 07:04:39 PDT 2017

RKSimon added inline comments.

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:22
+  // in-flight in the 64-macro-op in-flight window that the integer retire control unit provides.
+  let MicroOpBufferSize = 64; // Integer Retire Control Unit
   let LoadLatency = 5; // FPU latency (worse case cf Integer 3 cycle latency)
----------------
It is still the Retire Control Unit, its just that the FPU can only touch 44 of the entries.
```
let MicroOpBufferSize = 64; // Retire Control Unit
```

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:25
   let PostRAScheduler = 1;
-
   // FIXME: SSE4/AVX is unimplemented. This flag is set to allow
----------------
Don't remove whitespace.

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:94
                           int Lat> {
+
   // Register variant is using a single cycle on ExePort.
----------------
Undo this whitespace

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:176
 defm : JWriteResFpuPair<WriteFShuffle256, JFPU01, 1>;
-
 def : WriteRes<WriteFSqrt, [JFPU1, JLAGU, JFPM]> {
----------------
Don't remove whitespace.

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:370
+
+def WriteVMULPD: SchedWriteRes<[JFPU1]> {
+  let Latency = 4;
----------------
WriteVMULYPD

For all these defs, please can you include the 'Y' to make it clear that its just the 256-bit case

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:442
+
+// FIXME: We don't need 'Ld' version for AVX11 because deafult ResourceCycles == 1
+// TODO: How to use ResourceCycles from non-folding version like we do it for Latency?
----------------
What is AVX11?

Spelling: deafault -> default

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:450
+                  "VBROADCASTF128", "VBROADCASTSSrr", "VINSERTF128rr",
+                  "VMOVAP(D|S)rm", "VMOVDDUPYrr", "VMOVS(H|L)DUPYrr", "VMOVUP(D|S)Yrm",
+                  "VORP(S|D)Yrr", "VPERMILP(D|S)Yri", "VSHUFP(D|S)Yrri", "VUNPCK(H|L)P(D|S)rr",
----------------
"VMOVAP(D|S)rm" etc. are memory loads - they should be in the Ld version

================
Comment at: test/CodeGen/X86/avx-vzeroupper.ll:163
+; NO-VZ-NEXT:    popq %rbx
+; NO-VZ-NEXT:    retq
 entry:
----------------
What is causing this?

================
Comment at: test/CodeGen/X86/recip-fastmath.ll:344
 ; BTVER2-NEXT:    vrcpps %xmm0, %xmm1 # sched: [2:1.00]
+; BTVER2-NEXT:    vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] sched: [1:1.00]
 ; BTVER2-NEXT:    vmulps %xmm1, %xmm0, %xmm0 # sched: [2:1.00]
----------------
Latency should be 5cy

https://reviews.llvm.org/D33099