[PATCH] D52779: AMD BdVer2 (Piledriver) Initial Scheduler model

Fri Oct 5 04:07:47 PDT 2018

andreadb added inline comments.

================
Comment at: lib/Target/X86/X86ScheduleBdVer2.td:1067
+
+// VXORPSYrr, VXORPDYrr, VANDNPSYrr, VANDNPDYrr "zero-idioms" have latency of 1.
+
----------------
lebedev.ri wrote:
> andreadb wrote:
> > Do you plan to add these too?
> > I noticed that you have marked those as dep-breaking. However, if I read correctly, those still map to WriteFLogicY, which declares 2 resource cycles. Presumably these zero-idioms should only consume 1 resource cycle (to execute the zero-move to the upper half of YMM).
> Resource cycles is something i haven't touched at all.. Like completely at all.
> I do not even really understand how they are calculated.
> 
> I'm not fully sure what/how this should be, so i'd leave this as-is for now..
That comment is unexpected from a person that just wrote an entire scheduling model... How were you able to write all of this without knowing what "resource cycles" actually means? :-)

Anyway.... File TargetSchedule.td has a nice description of resource cycles. It is used to model the consumption of resources.

For a zero idiom XOR to consume the same resource cycles as a normal (i.e. non zero-idiom) XOR is really strange.

================
Comment at: lib/Target/X86/X86ScheduleBdVer2.td:55-56
+
+// Two AGLU pipes, identical.
+def PdAGLU01 : ProcResource<2>; // AGU, Integer Pipe[23]
+
----------------
This may not work as you expect.
The document here: https://www.realworldtech.com/bulldozer/8/
suggests that the L1D cannot sustain more than one store per cycle.

It is true that the two AGEN units are identical.
However, (correct me if I am wrong) I don't think that you can issue two stores per cycle.

Also, it would be interesting to see if you can actually issue two independent loads per cycle. As the 'realworldtech' document suggests, the extra load port in the L1D is probably used to avoid that the load bandwidth is halved when executing AVX 256-bit loads (which are 2 COPs).

Ideally, you should check if two independent loads can be issued in the same cycle (i.e. not just the LO/HI parts of a same AVX 256b load).
Also, It doesn't look like the L1D has two ports for store operations.

The easier way to workaround this issue is to define separate units for the load/store AGU, and let "writes" in tablegen select which AGEN they effectively consume.

================
Comment at: lib/Target/X86/X86ScheduleBdVer2.td:130-143
+// Load-Store Units
+//
+
+// FIXME: does this even make sense?
+
+def PdLoad  : ProcResGroup<[PdAGLU01]> {
+  // For Piledriver, the load queue is 40 entries deep.
----------------
To answer to your FIXME: I don't think it hurts to have those definitions.
You are essentially limiting the number of store operations to 24 (which is probably what you wanted to achieve here?).

A load would consume PdLoad buffer entries, and it would also consume entries in the PdEX unified scheduler.
Also, a load would be issued to PdAGLU01 (i.e. one of the two AGEN units).

A store behaves pretty much the same. The only difference is the size of the PdStore buffer (which matches the store queue).
It would consume one of the two AGU pipelines; it means, you allow the execution of two stores per cycles.

Repository:
  rL LLVM

https://reviews.llvm.org/D52779