[llvm-commits] [PATCH] Stack Coloring optimization

Thu Aug 30 07:24:56 PDT 2012

On Thu, Aug 30, 2012 at 10:19 AM, Daniel Berlin <dberlin at dberlin.org> wrote:
> On Thu, Aug 30, 2012 at 5:51 AM, Nadav Rotem <nrotem at apple.com> wrote:
>> Hi All,
>>
>> I've been working on a new optimization for reducing the stack size.  Currently, when we declare allocas in LLVM IR, these allocas are directly translated to stack slots. And when we inline small functions into larger function, these allocas add up and take up lots of space.  In some cases we know that the use of the allocas is bounded by disjoint regions.  In this optimization we merge multiple disjoint slots into a single slot.  LLVM uses the lifetime markers for specifying the regions in which the allcoa is used.  This patch propagates the lifetime markers through SelectionDAG and makes them pseudo ops.  Later, a pre-register-allocator pass constructs live intervals which represent the lifeless of different stack slots. Next, the pass merges disjoint intervals.  Notice that lifetime markers and not perfect single-entry-single exit regions. They may be removed by optimizations, they may start with two markers, and end with one, or even not end at all!
>>
>> So, why is this done in codegen?  There are a number of reasons. First, joining allocas may hinder alias analysis. Second, in the future we would like to share the alloca space with spill slots.
>>
>> The inliner has a 'hack' for merging allocas when inlining functions. We plan to remove this hack once this pass is tuned and we see that there are no regressions.  Also, we plan to look at joining multiple non-disjoint slot into a bigger disjoint slot.
>>
>> This work is based on code by Owen, and on feedback and ideas from a number of other engineers at Apple.
>>
>> Any comments or review are much appreciated.
>
> +  BitVector LiveInToggle = LocalLiveIn;
> +  LiveInToggle.reset(LIVE_IN[BB]);
> +      if (LiveInToggle.any()) {
> +        changed = true;
> +        LIVE_IN[BB] |= LocalLiveIn;
> +
> ...
>
> +      }
>
>
> It looks like you are copying the entire bitvector just to figure out
> if the reset changes anything (there are a few other places this is
> done too).
> That seems ugly and expensive (space/time wise) compared to just
> figuring out a good name for such a function in BitVector (say
> "Difference" or "EmptyDifference" or something) and implementing it
> there, and returning a bool from it.  Besides the space inefficiency
> of the copy, difference can return true the second it discovers any
> BitWord is different in A - B, whereas yours will process the entire
> "B" bitmap, performing a reset of all of those bits, *then* check
> whether something has changed.
>
> ie you should just write
> // Return true if lhs - rhs is nonempty
> bool Bitvector::Difference(Bitvector &lhs, Bitvector &rhs)
>
> and use that.
>

BTW, another alternative (which GCC does) would be to just have a
"union of difference"  (IE A U (B - A)) function that returns the
changed status.

This would reduce the code to

// Perform LIVE_OUT[BB] = LIVE_OUT[BB] U (LocalLiveOut - LIVE_OUT[BB])
bool localchanged = Bitvector::UnionOfDifference(LIVE_OUT[BB] , LocalLiveOut);
if (localchanged) {
  changed = true;
  for (MachineBasicBlock::succ_iterator SI = BB->succ_begin(),
        SE = BB->succ_end(); SI != SE; ++SI)
       NextBBSet.insert(*SI);
}

> You also iterate more than just the dirty blocks on each iteration of
> the dataflow computation, but I guess it's not expensive enough to
> matter.
>
>
>
>>
>>
>>
>>
>>
>> Thanks,
>> Nadav
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>