[PATCH] D84354: [AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic

Tue Jul 28 16:21:26 PDT 2020

rampitec requested changes to this revision.
rampitec added a comment.
This revision now requires changes to proceed.

This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.

In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:

  LoadSize = NumBytes / NumLoads;
  NumDWORDs = (LoadSize + 3) / 4;
  return NumDWORDs <= 8;

You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84354/new/

https://reviews.llvm.org/D84354