[PATCH] D33099: AMD Jaguar scheduler doesn't correctly model 256-bit AVX instructions

Wed May 17 08:47:19 PDT 2017

avt77 added inline comments.

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:71
+def JFPIntCluster : ProcResGroup<[JVALU0, JVALU1, JSTC]>;
+
 // Integer loads are 3 cycles, so ReadAfterLd registers needn't be available until 3
----------------
RKSimon wrote:
> I don't think adding these Cluster groups is necessary. TBH most of the ProcResource defs appear to be superfluous - most aren't used at all - we're just using the JFPU0/JFPU0/JFPU01 defs, with a few others for the longer op chain instructions.
Several nstructions below could be executed on FPA or on FPM that's why we need a possibility to say it and that's why I created JFPFltCluster. Is it OK?
And I created JFPIntCluster cluster just in case: should I remove it?

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:349
+
+def WriteFAddYY: SchedWriteRes<[JFPA]> {
+  let Latency = 3;
----------------
RKSimon wrote:
> Better off using JFPU0 as that's what is actually bound to the buffer. Same for the others below.
Are you sure it's a good idea? FP0 includes VALU0, VIMUL and FPA. I'm using FPA because this instruction uses exactly FPA.
At the same time if we use FPU0 then througput will be 2/3 = 0.666 and that's wrong.
Or you mean that our instruction is FP and it should not deal with VALU0, VIMUL? In this case we should change the algorithm again.

================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:357
+  let Latency = 8;
+  let ResourceCycles = [2];
+}
----------------
RKSimon wrote:
> Shouldn't this def be something like the below, to show it will consume the AGU for a cycle? Same for the other loads.
> ```
> def WriteFAddYMLd: SchedWriteRes<[JLAGU,JFPU0]>  {
>   let Latency = 8;
>   let ResourceCycles = [1,2];
> }
> ```
> 
I thought about but Software Optimization Guide does not show it (I mean it says about AGU but it does not include the additional cycle in its tables).  Should I update the model?

================
Comment at: test/CodeGen/X86/slow-unaligned-mem.ll:89
 ; FAST:       # BB#0:
-; FAST-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; FAST:         movl {{[0-9]+}}(%esp), %eax
 ; FAST-NOT:     movl
----------------
RKSimon wrote:
> ????
This test was written by hand that's why it's difficult to compare the results but the new version generates:

# BB#0:
        vxorps  %ymm0, %ymm0, %ymm0
        movl    4(%esp), %eax
        vmovups %ymm0, 32(%eax)
        vmovups %ymm0, (%eax)
        retl

As you see we have vxorps between # BB#0: and 'movl'. I decided it's acceptable. Am I wrong?

https://reviews.llvm.org/D33099