[PATCH] D52779: AMD BdVer2 (Piledriver) Initial Scheduler model

Sat Oct 6 03:31:02 PDT 2018

lebedev.ri added inline comments.

================
Comment at: lib/Target/X86/X86ScheduleBdVer2.td:55-56
+
+// Two AGLU pipes, identical.
+def PdAGLU01 : ProcResource<2>; // AGU, Integer Pipe[23]
+
----------------
andreadb wrote:
> This may not work as you expect.
> The document here: https://www.realworldtech.com/bulldozer/8/
> suggests that the L1D cannot sustain more than one store per cycle.
> 
> It is true that the two AGEN units are identical.
> However, (correct me if I am wrong) I don't think that you can issue two stores per cycle.
> 
> Also, it would be interesting to see if you can actually issue two independent loads per cycle. As the 'realworldtech' document suggests, the extra load port in the L1D is probably used to avoid that the load bandwidth is halved when executing AVX 256-bit loads (which are 2 COPs).
> 
> Ideally, you should check if two independent loads can be issued in the same cycle (i.e. not just the LO/HI parts of a same AVX 256b load).
> Also, It doesn't look like the L1D has two ports for store operations.
> 
> The easier way to workaround this issue is to define separate units for the load/store AGU, and let "writes" in tablegen select which AGEN they effectively consume.
Hmm, good point.
I agree regarding one store per cycle.
Some quotes:
* https://www.agner.org/optimize/microarchitecture.pdf
  * `Memory read instructions use AGLU0 and AGLU1.`
  * `Most execution units are doubled, as table 19.2 and 19.3 show, so that the throughput is
     two 128-bit operations or one 256-bit operation per clock cycle. <...>
     The store unit is not doubled, and 256-bit stores always take more than one clock cycle. `
  * `The data cache has two 128-bit ports which can be used for either read or write.
     This means that it can do two reads or one read and one write in the same clock cycle. `
  * `The measured throughput is two reads or one read and one write per clock cycle
     when only one thread is active.`
* https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
  * `The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache
     with two 128-bit ports. This is a write-through cache that supports up to
     two 128 Byte loads per cycle.`
    * `Only one load can be performed from a given bank of the L1 cache in a single cycle.` 
  * `There is an FPU load-store unit which supports up to two 128-bit loads and one 128-bit store per cycle.`
  *  `The LS unit supports two 128-bit loads/cycle and one 128-bit store/cycle.`

So either one store, or one store and one load, or two loads, at least that is how i read it.

================
Comment at: lib/Target/X86/X86ScheduleBdVer2.td:130-143
+// Load-Store Units
+//
+
+// FIXME: does this even make sense?
+
+def PdLoad  : ProcResGroup<[PdAGLU01]> {
+  // For Piledriver, the load queue is 40 entries deep.
----------------
andreadb wrote:
> To answer to your FIXME: I don't think it hurts to have those definitions.
> You are essentially limiting the number of store operations to 24 (which is probably what you wanted to achieve here?).
> 
> A load would consume PdLoad buffer entries, and it would also consume entries in the PdEX unified scheduler.
> Also, a load would be issued to PdAGLU01 (i.e. one of the two AGEN units).
> 
> A store behaves pretty much the same. The only difference is the size of the PdStore buffer (which matches the store queue).
> It would consume one of the two AGU pipelines; it means, you allow the execution of two stores per cycles.
> You are essentially limiting the number of store operations to 24
> (which is probably what you wanted to achieve here?).
Yes.

> it means, you allow the execution of two stores per cycles.
Yeah, that is a bug.

Repository:
  rL LLVM

https://reviews.llvm.org/D52779