[PATCH] D39350: AMDGPU: Add CPUCoherentL2 feature

Sat Oct 28 10:41:30 PDT 2017

t-tye requested changes to this revision.
t-tye added a comment.
This revision now requires changes to proceed.

In https://reviews.llvm.org/D39350#909638, @jvesely wrote:

> In https://reviews.llvm.org/D39350#909481, @t-tye wrote:
>
> > In https://reviews.llvm.org/D39350#909426, @jvesely wrote:
> >
> > > In https://reviews.llvm.org/D39350#908854, @t-tye wrote:
> > >
> > > > Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.
> > >
> > >
> > > The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
> > >  Has this been changed?
> > >  I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
> > >  The ISA specs also don't mention what values are allowed in those 3 bits.
> >
> >
> > I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.
> >
> > What runtime are you intending to use to load and execute the code produced?
> >
> > How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?
>
>
> For my specific use case (system calls) I use HCC, but I'd expect this to generally apply to system scope atomics.
>  The bug has been reported here [0], and I wrote simple atomic tests [1].
>
> Configuring "System Level Coherent" flag to not guarantee system level coherence is a bit counter intuitive decision.
>  I'm not sure I follow the performance or MTYPE argument. Only system scope atomic operations need this, I'd expect them to be slow, and they should be used sparingly.
>  Can you point me to where MTYPE values are set (and documented)? I only found rsrc descriptor setup for scratch memory in ROCR.
>
> [0] https://github.com/RadeonOpenCompute/hcc/issues/410
>  [1] https://github.com/jvesely/hcc-atomic-test

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The https://reviews.llvm.org/L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(https://reviews.llvm.org/L1 only)/HIT and cache retention policy LRU/EVICT-(https://reviews.llvm.org/L1 only)/STREAM-(L2 only) for https://reviews.llvm.org/L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings. The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the https://reviews.llvm.org/L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Repository:
  rL LLVM

https://reviews.llvm.org/D39350