[PATCH] D81085: [AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size

Fri Jun 5 02:09:27 PDT 2020

hsmhsm marked an inline comment as done.
hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:472-473
+         "Invalid NumLoads/NumBytes values");
+  unsigned AverageBytesPerLoad = NumBytes / NumLoads;
+  unsigned MaxNumLoads = AverageBytesPerLoad <= 4 ? 5 : 4;
+  return NumLoads <= MaxNumLoads;
----------------
foad wrote:
> I would suggest avoiding this division (because division is slow and introduces rounding). And the magic numbers need some kind of explanation. How did you come up with them?
> 
> So how about reformatting this as:
> ```
> if (NumBytes <= 4 * NumLoads) {
>   // Loads are dword or smaller (on average).
>   MaxNumLoads = 5;
> } else {
>   // Loads are bigger than a dword (on average).
>   MaxNumLoads = 4;
> }
> ```
> ... plus some explanation of where the numbers 5 and 4 came from, and why they should be different cases.
Your suggestions make sense to me. And regarding those magic number 4 and 5, there is no analytical thinking here.  It is purely based on experimentation to achieve below three goals.

(1)  LLVM lit regressions should be as minimal as possible
(2)  OpenCL shoc benchmark should not show any performance degradation (a kind of representative benchmark for `compute`)
(3)  The performance issue in question (rocSPARSE benchmark) should show  improvements.

After experimentation with different magic numbers, and different heuristics, this is what I could settle down with.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81085/new/

https://reviews.llvm.org/D81085