[PATCH] D39350: AMDGPU: Add CPUCoherentL2 feature

Jan Vesely via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Sat Oct 28 20:08:46 PDT 2017


jvesely added a comment.

In https://reviews.llvm.org/D39350#910078, @t-tye wrote:

> The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.
>
> For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.
>
> To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The https://reviews.llvm.org/L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).
>
> The GLC and SLC bits jointly determine the cache hit policy MISS-(https://reviews.llvm.org/L1 only)/HIT and cache retention policy LRU/EVICT-(https://reviews.llvm.org/L1 only)/STREAM-(L2 only) for https://reviews.llvm.org/L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).
>
> Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.


which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

> The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the https://reviews.llvm.org/L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.
> 
> The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.
> 
> There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.
> 
> [0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses.


Repository:
  rL LLVM

https://reviews.llvm.org/D39350





More information about the llvm-commits mailing list