[PATCH] D84354: [AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic

Wed Jul 29 11:23:35 PDT 2020

rampitec added a comment.

In D84354#2181238 <https://reviews.llvm.org/D84354#2181238>, @hsmhsm wrote:

> In D84354#2180481 <https://reviews.llvm.org/D84354#2180481>, @rampitec wrote:
>
>> This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.
>>
>> In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:
>>
>>   LoadSize = NumBytes / NumLoads;
>>   NumDWORDs = (LoadSize + 3) / 4;
>>   return NumDWORDs <= 8;
>>
>> You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.
>
> Thanks for the further insights here, yes, I agree with you. However, I think, the proposed above heuristic still have some problems. Let's take below example. Let's assume there are N load operations, each loads 16 bytes, are put into clustering process. Then below is the scenario:
>
> | NumLoads | NumBytes | LoadSize | NumDWORDs | NumDWORDs <= 8 | Decision |
> | 2        | 32       | 16       | 4         | 4 <= 8         | cluster  |
> | 3        | 48       | 16       | 4         | 4 <= 8         | cluster  |
> | 4        | 64       | 16       | 4         | 4 <= 8         | cluster  |
> | N        | 64 * N   | 16       | 4         | 4 <= 8         | cluster  |
> |
>
> So, we land-up clustering all the N loads. Is it that you meant `NumDWORDs = ((LoadSize + 3) / 4) * NumLoads` instead of   `NumDWORDs = (LoadSize + 3) / 4`?

Yep, you are right! My NumDWORDs is in a single load, it should be multiplied by the NumLoads.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84354/new/

https://reviews.llvm.org/D84354