[PATCH] D30759: With PIE on x86_64, keep hot local arrays on the stack

Mon Apr 24 15:00:47 PDT 2017

tmsriram added a comment.

In https://reviews.llvm.org/D30759#696138, @efriedma wrote:

> >> In theory, the C/C++ standards require behavior equivalent to -fno-merge-all-constants.  In practice, code doesn't actually depend on that, so we made the decision many years ago to turn on -fmerge-all-constants by default.
> > 
> > Understood.  Does it seem reasonable/useful to fix this along the
> >  lines of GCC, -fmerge-constants and -fmerge-all-constants where the
> >  latter applies to const arrays and a warning that this is happening
> >  when the latter option is used?
>
> -fmerge-all-constants has exactly the same meaning in clang and gcc.  And it's generally beneficial for both codesize and performance, so turning it off to pursue performance is a bad idea.
>
> I would suggest finding some other approach to solve your issue later in the optimization pipeline, preferably in a manner which is sensitive to register pressure.  Maybe put the code in ConstantHoisting?  You don't lose any useful information by promoting the alloca to a global constant; you can easily recreate it

Digging a little more, here is what I found.

- Like Eli pointed out, this is an optimization that is needed only when there is register pressure, otherwise the address computation can be hoisted out.
- To recap, a global array access in PIE mode needs two instructions (for X86_64), an address computation using lea and the actual element access.  If the array access is inside a hot loop, we have noticed the performance drop by a few percent due to the increased dynamic instruction count from the address computation when compared to non-PIE code.
- Machine LICM does hoist the address computation of the array outside the loop but register allocation will sink the address computation back near the use via rematerialization if the register pressure is high.

I can think of two different ways in which I can solve this problem

a) Implement this in a late optimization pass, in LLVM IR and use a heuristic to compute register pressure.

- When compiling for PIE,  the optimization would move a global array to the stack if the register pressure estimate of the  function where it is used is high.
- Use a heuristic to compute the register usage.  This is already done for instance in loop vectorizaton, function "LoopVectorizationCostModel::calculateRegisterUsage".
- The heuristic would use the number of overlapping live ranges as the estimate for register usage.
- If implemented in a late pass, just before code generation, the estimates would tend to be closer to actual.

b) Teach rematerialization, during greedy register allocation, to move global arrays to stack instead of recomputing the address everytime it is used.

- This would be done in machine IR, during register allocation, when it is known that this is about to be rematerialized.
- However, this optimization seems heavy-weight to do in machine level IR, not sure about the complexity/feasibility of doing this here.
- The only reason to justify doing this here is the absence of a register usage estimation function in LLVM IR.

It looks like a) would be the way to go here.  The need for a generic register usage estimation function has already been discussed by Wei to drive other optimizations .  What do you think?

https://reviews.llvm.org/D30759