[PATCH] D122850: [AMDGPU] Fix regression with vectorization limiting

Thu Mar 31 15:12:24 PDT 2022

rampitec added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp:313
+  // interleaving loops, so we lie to avoid trying to use all registers.
+  return std::min(NumRegs, 4u);
 }
----------------
rampitec wrote:
> arsenm wrote:
> > 4 seems really small
> It is enough to allow vectorization, all we need really. Giving more immediately explodes RP because of the interleaving. That can be possible to increase this, but then limit interleaving much more.
Here is the loop triggered the investigation:
```
          for (int i = rowStart; i < rowEnd; i++) {
            gq += temp[i];
          }
```
gs/temp are float. The whole kernel w/o loop-vectorize uses 9 VGPRs, with the vecotrizer as it is now 78. With this change it goes down to 38 which is still higher than wanted. If I allow 8 registers final budget is 78 VGPRs again, and to bring it back down to 38 I have to disable interleave. Even interleave factor of 2 plus 8 registers reported here results in 46 VGPRs.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122850/new/

https://reviews.llvm.org/D122850