[PATCH] D99128: [AMDGPU] Removed unnecessary cache invalidations.

Wed Mar 24 08:02:18 PDT 2021

s-perron added a comment.

I've made some changes.  I will look into some of the improvements that have been suggested.

================
Comment at: llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:725
+  case SMEM_ACCESS:
+  case GDS_ACCESS:
+    setPotentiallyDirtyCacheAtLevel(MEM_CACHE_LVL_0);
----------------
t-tye wrote:
> Is GDS needed here? Seems this is only dealing with VMEM. GDS is like LDS and not part of VMEM.
> 
> Should SMEM writes be considered? Some targets support that.
> Is GDS needed here? 

Does a load from GDS go through the L1 cache?  I was under the impression that it does, but I'm not all that familiar with the architecture.  If it does not then it is not needed.

> Should SMEM writes be considered? 

The SMEM_ACCESS event covers both reads and writes to SMEM, so this should already be covered.  See the comment by the definition of the enum.

================
Comment at: llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:726-727
+  case GDS_ACCESS:
+    setPotentiallyDirtyCacheAtLevel(MEM_CACHE_LVL_0);
+    setPotentiallyDirtyCacheAtLevel(MEM_CACHE_LVL_1);
+    break;
----------------
t-tye wrote:
> Should the cache bypass of the instructions be considered? When the cache is bypassed the data is not left in the cache and so does not cause the need for an invalidate. But the cache bypass can specify which caches it is bypassing. For example, performing a series of relaxed atomics at system scope would require no invalidates.
> 
> Should the impact of writes be considered in deciding on the need for a cache writeback? Again there can be cache bypass, and some caches are readonly or write-through.
> Should the cache bypass of the instructions be considered? 

I did not know about the cache bypass.  I'll see what I can do about that.

> some caches are readonly or write-through?

That could be useful.  Is this something that can be easily queried?  If determining the behaviour of the cache would require a lot of new code, I would prefer to make that change in a subsequent patch.  The analysis would still be correct, even if it is a bit conservative.  The same for the cache bypass.

================
Comment at: llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:732
+  case EXP_PARAM_ACCESS:
+    setPotentiallyDirtyCacheAtLevel(MEM_CACHE_LVL_0);
+    break;
----------------
t-tye wrote:
> LDS has no cache control so should if be considered?
> 
> Why only level 0 as loading into the L0 can also cause data to be loaded into the other caches.
I guess I misunderstood how the arch is designed.  Would a read from LDS load data into the L1 cache?  I was told "Another subtlety is that on gfx10 in WGP mode, the L0 cache needs to be invalidated even for LDS."  I'm not being precise enough here, I should be checking which mode we are in and possibly which GFX version.

I also want to ask about the export instructions.  I made a conservative guess that they would require the L0 cache invalidation, but I am not sure how the output buffer is implemented.

================
Comment at: llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp:938
+      instructionInvalidatesL1Cache(MI.getOpcode())) {
     Wait.VmCnt = 0;
   }
----------------
t-tye wrote:
> Is this correct? These instructions do not ensure the waicnt is 0. This is also missing GFX90A cache instructions.
I do not know.  I just tried keep this functionally the same.  I could revert my change to this, and then it could be fixed or removed with a different patch.  I don't want to fix completely unrelated fixes with this patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99128/new/

https://reviews.llvm.org/D99128