[PATCH] D84354: [AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic
Stanislav Mekhanoshin via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 28 16:21:26 PDT 2020
rampitec requested changes to this revision.
rampitec added a comment.
This revision now requires changes to proceed.
This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.
In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:
LoadSize = NumBytes / NumLoads;
NumDWORDs = (LoadSize + 3) / 4;
return NumDWORDs <= 8;
You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D84354/new/
https://reviews.llvm.org/D84354
More information about the llvm-commits
mailing list