[PATCH] D125102: [RegAllocGreedy] New hook regClassPriorityTrumpsGlobalness

Thu May 12 08:07:17 PDT 2022

foad added inline comments.
Herald added a subscriber: kosarev.

================
Comment at: llvm/lib/CodeGen/RegAllocGreedy.cpp:291
   unsigned Prio;

   auto Stage = ExtraInfo->getOrInitStage(Reg);
----------------
bjope wrote:
> foad wrote:
> > bjope wrote:
> > > Just for info:
> > > 
> > > Downstream we are doing
> > > ```
> > >  if (<our downstream target>) {
> > >     // Let AllocationPriority affect all ranges.
> > >     const TargetRegisterClass &RC = *MRI->getRegClass(Reg);
> > >     Size = Size * (RC.AllocationPriority + 1);
> > >   }
> > > ```
> > > here (I think Quentin have suggested something like that in the past).
> > > 
> > > Anyway, I tried replacing that old hack by this patch, but got some mixed results. Same thing if using both patches together.
> > > 
> > > Last time I tried to do something in this area I had a hard time finding some heuristic that gave generally better result without some occasional larger regression. Kind of annoying, since your patch also seem to indicate that there is a potential gain here also for our target in several benchmarks (but regressions by almost 20% in a couple of our benchmarks can't be ignored completely). That might of course be due to other shortcomings in our backend (such as the AllocationPriority setup etc). I guess I need to investigate that a bit closer before we consider to use this new option for our target.
> > Thanks for trying it. Does your downstream target also have a concept of occupancy, so that using fewer physical registers means that things run significantly faster on the hardware? I'm aware that the register allocator was developed for CPUs and CPUs generally do not have that property.
> No I don't think we have that (if I understand the question correctly).
> 
> But we have a limited set of registers (typically 16-32 regs). And similar to what you mentioned about AMDGPU we do have some wide register classes that consist of tuples and quads. Certain quads might be costly to spill/reload, and we do not want that to happen inside a loop for example. So generally I think we want globalness to trump the allocation prio, but sometimes it is bad to allocate a long range quad early since then we have to spill it (mostly guessing here).
I have tried your `Size = Size * (RC.AllocationPriority + 1);` heuristic but it doesn't help in the cases I am interested in, because I really need a wide local range to have higher priority than a narrow global range.

Just an idea: we could split the actual calculation of the Prio metric out into a separate function like this: https://reviews.llvm.org/differential/diff/428948/
... and then make it a TargetRegisterInfo hook so that targets could tweak the priority however they like?

In the meantime I would like to proceed with the current patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125102/new/

https://reviews.llvm.org/D125102