[llvm-dev] Dynamically determine the CostPerUse value in the register allocator.

Fri May 29 07:15:48 PDT 2020

[AMD Official Use Only - Internal Distribution Only]

Hi All,

For the AMDGPU architecture, during RA, we prefer to have a cost associated with the registers (CostPerUse) based on a target entity (for instance, the Calling Convention of the current MachineFunction).
Presently CostPerUse is a one-time static value (either zero or a positive value) generated through table-gen.
The current implementation doesn't allow us to control the reg-cost on the fly.

The AMDGPU ABI has recently been revised by introducing more caller-saved VGPRs (the exact details are explained towards the end of this e-mail), and found that having a dynamic register cost is important to achieve an optical allocation.
Precisely, it is important to limit the number of VGPRs allocated for a kernel/device-function to a smallest value since it will have a direct impact on the occupancy. The occupancy means the number of wavefronts that can be launched at runtime for a kernel program.

Some initial thoughts on how to fix it:

  1.  Have a target interface (a switch) to enable/discard the CostPerUse value.
  2.  Get the register cost in the same way we define various calling conventions (*CallingConv.td).
  3.  Compute the CostPerUse in the way the AllocationOrder for the registers is determined during RA.

The first one is the easiest method and that solves the immediate problem we currently address.
However, the other two options are better if we want to associate different reg-cost values for different calling conventions (I presume, it will arise at some point).
Other than these options, there can be a better way to fix it. Any suggestion in this regard would be helpful.

AMDGPU ABI changes and the motivation for this discussion:

Before the new ABI change:
Apart from the initial reserved 32 argument registers, all VGPRs are callee-saved registers (VGPR32 - VGPR255).
With the new ABI:
We made VGPR32 - VGPR255 into equal number of callee-saved and caller-saved registers.
For the same occupancy reason, these two sets are interleaved at a split boundary of 8.
VGPR32-VGPR39 (Caller-saved)
VGPR40-VGPR47 (Callee-saved)
VGPR48-VGPR55 (Caller-saved)
              -
              -
VGPR248-VGPR255 (Callee-saved)

With the new ABI, the allocator's preference for callee-saved vs caller-saved depends on the input program.
RA may end up allocating more caller-saved registers than the callee-saved in certain cases. The other way of allocation is possible too (more callee-saved registers)
In either case, there will be unallocated registers left behind, bumping up the final VGPRs into a considerable number. It will have a bad impact on the occupancy.
To override the default allocation preferences of RA, we tried to set a cost for all VGPRs such that the higher indices will have higher cost.
It eliminated the problem by allocating all lower registers before picking the higher one, and with an expense of some spills in certain cases which is acceptable.

But for the kernels with no device-function calls, the register cost is unnecessary. Because there is no ABI for such kernel programs.
It caused a performance penalty for such kernels due to the register cost.
That's the exact reason we need a method to determine dynamically either to have a reg-cost or not to have one.

Regards,
Christudasan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200529/d1e02e10/attachment.html>