[PATCH] D15393: [X86] Order the local stack symbols to improve code size and locality.

Wed Dec 9 16:10:13 PST 2015

zansari added a comment.

In http://reviews.llvm.org/D15393#306456, @rnk wrote:

> Oops, I hit submit too early.
>
> Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that your algorithm may reorder all of the potentially large allocations outside of the 256 bytes of stack that we can address with a smaller offset, and unnecessarily increase stack frame sizes.
>
> I wonder if we can rephrase this as a weighted bin packing problem, where we only have one bin and it has size ~128 bytes, or the max one byte displacement from FP/SP. The object weights would be the use counts, and the goal is to put as many uses into the bin as possible. There's probably a good approximate dynamic programming algorithm that we could use for that.

Hi Reid.. Thanks for the review.

I'm not quite understanding what you mean in your first paragraph regarding stack frame size. Perhaps I'm missing something simple, but I don't see what significance large allocations have, and I also don't see the significance of 256 bytes.

With respect to stack frame size, I do see the potential for increasing it, in some cases, based on how objects with larger "alignment" are ordered with respect to each other, but I'm missing how "large allocations outside of 256 bytes" comes into play. I did try to include alignment into the heuristics to help this out a little bit (it's not perfect, of course, but I felt that it require a fair jump in complexity to achieve better). In general, I didn't see significant increases in stack frame size, at least in the benchmarks that I was looking at.

Also (and perhaps this applies to your second comment), one of my main goals in this pass was to make it cheap and simple while getting as much benefit out of it as possible over what we already have. I initially toyed around with a bunch of additional "smarter" heuristics that required more complexity and iterations, but the tiny extra saving they gave me in a few cases wasn't worth the extra compile time, in my opinion. I found that this simple, single pass algorithm caught a bulk of the code saving opportunities (which were fairly significant, in some case).

Do you feel this is worth additional complexity to squeeze a little more out of?

http://reviews.llvm.org/D15393