[PATCH] D39350: AMDGPU: Add CPUCoherentL2 feature

Sun Oct 29 08:57:22 PDT 2017

t-tye added a comment.

In https://reviews.llvm.org/D39350#910115, @jvesely wrote:

> In https://reviews.llvm.org/D39350#910078, @t-tye wrote:
>
> > The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.
> >
> > For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.
> >
> > To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The https://reviews.llvm.org/L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).
> >
> > The GLC and SLC bits jointly determine the cache hit policy MISS-(https://reviews.llvm.org/L1 only)/HIT and cache retention policy LRU/EVICT-(https://reviews.llvm.org/L1 only)/STREAM-(L2 only) for https://reviews.llvm.org/L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).
> >
> > Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.
>
>
> which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.
>  Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?
>
> > The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the https://reviews.llvm.org/L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.
> > 
> > The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.
> > 
> > There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.
> > 
> > [0] https://llvm.org/docs/AMDGPUUsage.html#memory-model
>
> Thank you. it was helpful.
>  If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

In https://reviews.llvm.org/D39350#910115, @jvesely wrote:

> In https://reviews.llvm.org/D39350#910078, @t-tye wrote:
>
> > The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.
> >
> > For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.
> >
> > To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The https://reviews.llvm.org/L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).
> >
> > The GLC and SLC bits jointly determine the cache hit policy MISS-(https://reviews.llvm.org/L1 only)/HIT and cache retention policy LRU/EVICT-(https://reviews.llvm.org/L1 only)/STREAM-(L2 only) for https://reviews.llvm.org/L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).
> >
> > Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.
>
>
> which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the https://reviews.llvm.org/L1 cache.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the https://reviews.llvm.org/L1 and L2 caches.

> Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

> 
> 
>> The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the https://reviews.llvm.org/L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.
>> 
>> The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.
>> 
>> There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.
>> 
>> [0] https://llvm.org/docs/AMDGPUUsage.html#memory-model
> 
> Thank you. it was helpful.
>  If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

Repository:
  rL LLVM

https://reviews.llvm.org/D39350