[PATCH] D33099: AMD Jaguar scheduler doesn't correctly model 256-bit AVX instructions
Andrew V. Tischenko via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed May 17 08:47:19 PDT 2017
avt77 added inline comments.
================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:71
+def JFPIntCluster : ProcResGroup<[JVALU0, JVALU1, JSTC]>;
+
// Integer loads are 3 cycles, so ReadAfterLd registers needn't be available until 3
----------------
RKSimon wrote:
> I don't think adding these Cluster groups is necessary. TBH most of the ProcResource defs appear to be superfluous - most aren't used at all - we're just using the JFPU0/JFPU0/JFPU01 defs, with a few others for the longer op chain instructions.
Several nstructions below could be executed on FPA or on FPM that's why we need a possibility to say it and that's why I created JFPFltCluster. Is it OK?
And I created JFPIntCluster cluster just in case: should I remove it?
================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:349
+
+def WriteFAddYY: SchedWriteRes<[JFPA]> {
+ let Latency = 3;
----------------
RKSimon wrote:
> Better off using JFPU0 as that's what is actually bound to the buffer. Same for the others below.
Are you sure it's a good idea? FP0 includes VALU0, VIMUL and FPA. I'm using FPA because this instruction uses exactly FPA.
At the same time if we use FPU0 then througput will be 2/3 = 0.666 and that's wrong.
Or you mean that our instruction is FP and it should not deal with VALU0, VIMUL? In this case we should change the algorithm again.
================
Comment at: lib/Target/X86/X86ScheduleBtVer2.td:357
+ let Latency = 8;
+ let ResourceCycles = [2];
+}
----------------
RKSimon wrote:
> Shouldn't this def be something like the below, to show it will consume the AGU for a cycle? Same for the other loads.
> ```
> def WriteFAddYMLd: SchedWriteRes<[JLAGU,JFPU0]> {
> let Latency = 8;
> let ResourceCycles = [1,2];
> }
> ```
>
I thought about but Software Optimization Guide does not show it (I mean it says about AGU but it does not include the additional cycle in its tables). Should I update the model?
================
Comment at: test/CodeGen/X86/slow-unaligned-mem.ll:89
; FAST: # BB#0:
-; FAST-NEXT: movl {{[0-9]+}}(%esp), %eax
+; FAST: movl {{[0-9]+}}(%esp), %eax
; FAST-NOT: movl
----------------
RKSimon wrote:
> ????
This test was written by hand that's why it's difficult to compare the results but the new version generates:
# BB#0:
vxorps %ymm0, %ymm0, %ymm0
movl 4(%esp), %eax
vmovups %ymm0, 32(%eax)
vmovups %ymm0, (%eax)
retl
As you see we have vxorps between # BB#0: and 'movl'. I decided it's acceptable. Am I wrong?
https://reviews.llvm.org/D33099
More information about the llvm-commits
mailing list