[llvm-dev] linear-scan RA

Tue Sep 11 11:23:19 PDT 2018

Yes, I quite liked the things I've read about the PBQP allocator.

Given what the hardware folks have to go through to get 1% improvements in
scalar code,
spending 20% (or whatever) compile time (under control of a flag) seems
like nothing.
And falling back on "average code" is a little disingenuous.
People looking for performance don't care about average code;
they care about their own personal code, their important loop, and wonder
why there are unnecessary copies right *there* and wasn't this problem
solved ages ago?

Preston

On Tue, Sep 11, 2018 at 11:00 AM, Quentin Colombet <
quentin.colombet at gmail.com> wrote:

> Le mar. 11 sept. 2018 à 10:23, Preston Briggs
> <preston.briggs at gmail.com> a écrit :
> >
> > Hi,
> >
> > Using Chaitin's approach, removing a copy via coalescing could expose
> more opportunities for coalescing.
> > So he would iteratively rebuild the interference graph and check for
> more opportunities.
>
> I am not sure Chaitin's approach would lead to a better coalescing
> than what LLVM does.
> I believe most of the copies that we see in the final code comes from
> live-range splitting (i.e., to avoid spill), not coalescing, but I am
> nitpicking :).
>
> >
> > Chaitin was also careful to make sure that the source and destination of
> a copy didn't interfere unnecessarily (because of the copy alone);
> > that is, his approach to interference was very precise. The idea of
> computing interference from intersecting ranges seems less precise,
> > though I owe Matthias examples demonstrating these supposed weaknesses.
> >
> > Finally, when spills are introduced, more opportunities for coalescing
> can arise.
> > By separating coalescing and allocation, it seems like you miss an
> opportunity.
> >
>
> LLVM still eliminates copies at coloring time. The available registers
> are biases toward the colors used by whatever copies connected to
> current live-range.
>
> Also, to be fair, LLVM provides a graph coloring regalloc with the
> PBQP allocator.
>
> That said, I agree that we trade compile time for code quality, but
> pragmatically, it is hard to justify spending say 20% time in compile
> time to achieve maybe 1% speed up on average (numbers completely mad
> up :)). If you're willing to pay the price, the PBQP may be an option.
> ARM made a talk showing how it performed
> (https://llvm.org/devmtg/2014-10/Slides/PBQP-update-and-in-the-wild.pdf).
>
> What I am saying is, I believe you're right but there isn't a big
> enough incentive for a company to invest in such project.
>
> > I admit that the extra time spent rebuilding the interference graph
> > and attempting to coalesce can add up, but I hate that we (apparently)
> give up on quality
> > results to maybe save some compile time. Our machines are really fast
> these days;
> > lets spend their speed on something useful!
> >
> > Thanks for the pointer and explanation,
> > Preston
> >
> >
> > On Tue, Sep 11, 2018 at 9:40 AM, Quentin Colombet <
> quentin.colombet at gmail.com> wrote:
> >>
> >> Hi Preston,
> >>
> >> To summarize what Matthias said, LLVM approach to regalloc is split in
> >> two main phases:
> >> 1. Aggressive coalescing based on the "interference graph" of the
> >> variable in SSA (we don't actually build the interference graph). At
> >> that stage, we don't care about the whether or not the resulting
> >> representation is colorable with K register.
> >> 2. Priority based register allocation with live-range splitting.
> >> Basically when we are about to spill, instead of doing that right
> >> away, we have heuristics to split the live-range and re-enqueue the
> >> results.
> >>
> >> The most high-level description of LLVM's regalloc that I can think of
> >> is available here:
> >> https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pdf
> >>
> >> Cheers,
> >> -Quentin
> >> Le lun. 10 sept. 2018 à 17:28, Matthias Braun via llvm-dev
> >> <llvm-dev at lists.llvm.org> a écrit :
> >> >
> >> >
> >> >
> >> > On Sep 10, 2018, at 5:25 PM, Matthias Braun <mbraun at apple.com> wrote:
> >> >
> >> >
> >> >
> >> > On Sep 10, 2018, at 5:11 PM, Preston Briggs <preston.briggs at gmail.com>
> wrote:
> >> >
> >> > The phi instruction is irrelevant; just the way I think about things.
> >> > The question is if the allocator believes that t0 and t2 interfere.
> >> >
> >> > Perhaps the coalescing example was too simple.
> >> > In the general case, we can't coalesce without a notion of
> interference.
> >> >
> >> > There is also logic in `LiveRange::overlaps()` that considers live
> ranges as not overlapping in some cases where they are related via copy
> instructions. This will also effect the interference during the main
> allocation algorithm.
> >> >
> >> > My worry is that looking at interference by ranges of instruction
> numbers
> >> > leads to inaccuracies when a range is introduced by a copy.
> >> >
> >> >
> >> > I can't think of a case where we wouldn't have the same accuracy as a
> graph coloring allocator.
> >> > If you can construct an example where we don't I'd love to hear about
> it.
> >> >
> >> > If you wonder about the liveness information, you can perform
> experiments like this (I'm using a NOOP instruction with an implicit use
> operand to produce some artificial uses).
> >> >
> >> > $ cat test.mir
> >> > name: somefunc
> >> > body: |
> >> >   bb.0:
> >> >     %0:gr32 = MOV32ri 42
> >> >     JB_1 %bb.2, undef implicit $eflags
> >> >     JMP_1 %bb.2
> >> >
> >> >   bb.1:
> >> >     %1:gr32 = MOV32ri 17
> >> >     JMP_1 %bb.3
> >> >
> >> >   bb.2:
> >> >     NOOP implicit %0
> >> >     %1 = COPY %0
> >> >     JMP_1 %bb.3
> >> >
> >> >   bb.3:
> >> >     NOOP implicit %1
> >> >
> >> >
> >> >
> >> > $ llc -run-pass=liveintervals -debug-only=regalloc test.mir
> >> > ********** INTERVALS **********
> >> > %0 [16r,64B:0)[112B,144r:0)  0 at 16r weight:0.000000e+00
> >> > %1 [80r,112B:1)[144r,176B:0)[176B,192r:2)  0 at 144r 1 at 80r 2 at 176B-phi
> weight:0.000000e+00
> >> > RegMasks:
> >> > ********** MACHINEINSTRS **********
> >> > # Machine code for function somefunc: NoPHIs
> >> >
> >> > 0B bb.0:
> >> >  successors: %bb.2(0x80000000); %bb.2(100.00%)
> >> >
> >> > 16B  %0:gr32 = MOV32ri 42
> >> > 32B  JB_1 %bb.2, implicit undef $eflags
> >> > 48B  JMP_1 %bb.2
> >> >
> >> > 64B bb.1:
> >> >  successors: %bb.3(0x80000000); %bb.3(100.00%)
> >> >
> >> > 80B  %1:gr32 = MOV32ri 17
> >> > 96B  JMP_1 %bb.3
> >> >
> >> > 112B bb.2:
> >> > ; predecessors: %bb.0
> >> >  successors: %bb.3(0x80000000); %bb.3(100.00%)
> >> >
> >> > 128B  NOOP implicit %0:gr32
> >> > 144B  %1:gr32 = COPY %0:gr32
> >> > 160B  JMP_1 %bb.3
> >> >
> >> > 176B bb.3:
> >> > ; predecessors: %bb.1, %bb.2
> >> >
> >> > 192B  NOOP implicit %1:gr32
> >> >
> >> > # End machine code for function somefunc.
> >> >
> >> >
> >> > If you look at the "intervals" (the class is a misnomer since
> nowadays it contains a list of ranges...) in the beginning you see that %0
> and %1 do not overlap anywhere.
> >> >
> >> > - Matthias
> >> >
> >> >
> >> > But perhaps I should focus on the links and, as you suggested,
> >> > the debugging info.
> >> >
> >> > Thanks,
> >> > Preston
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Sep 10, 2018 at 5:02 PM, Matthias Braun <mbraun at apple.com>
> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Sep 10, 2018, at 4:53 PM, Preston Briggs <
> preston.briggs at gmail.com> wrote:
> >> >>
> >> >>
> >> >> > The underlying liveness datastructure is a list of ranges where
> each vreg is alive
> >> >> > (ranges in terms of instructions numbered). I remember a couple of
> later linear scan
> >> >> > papers describing the same thing (Traub et.al. being the first if
> I remember correctly).
> >> >> > That should be as accurate as you can get in terms of liveness
> information.
> >> >>
> >> >> It depends on the details.
> >> >> For example, given
> >> >>
> >> >> t0 = mumble
> >> >>
> >> >> if (something) {
> >> >>   t2 = 3
> >> >> }
> >> >> else {
> >> >>   t3 = t0 + 3
> >> >>   print t0
> >> >> }
> >> >> t4 = phi(t2, t3)
> >> >>
> >> >>
> >> >> it's clear that t2 and t0 shouldn't interfere,
> >> >> but some folks might say the ranges overlap.
> >> >>
> >> >>
> >> >> Similarly,
> >> >>
> >> >> t6 = mumble
> >> >> t7 = t6
> >> >> t8 = t6 + 5
> >> >> t9 = t7 + 10
> >> >> print t8, t9
> >> >>
> >> >>
> >> >> Chaitin points out that t6 and t7 shouldn't interfere,
> >> >> even though the live ranges overlap.
> >> >>
> >> >> - We go out of SSA form before allocating registers so you won't see
> phi instruction.
> >> >> - In the second case the copy coalesceing pass should have coalesced
> t6 and t7.
> >> >>
> >> >> I would expect both cases to work as you expect.
> >> >>
> >> >> - Matthias
> >> >>
> >> >>
> >> >>
> >> >> Anyway, I'll look at the links.
> >> >>
> >> >> Thanks,
> >> >> Preston
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>>
> >> >>>
> >> >>> We have separate aggressive coalescing pass before allocation. The
> register allocator will later perform live range splitting inside the main
> allocation loop as it seems fit.
> >> >>>
> >> >>>
> >> >>> I ask these questions because we (guys I work with) see loops
> >> >>> where there's a little register juggling that seems unnecessary.
> >> >>>
> >> >>> Well your best bet is to learn reading the output of `-mllvm
> -print-machineinstrs -mllvm -debug-only=regalloc` (which can take a
> while)...
> >> >>>
> >> >>>
> >> >>> Is there a paper that describes what y'all do?
> >> >>>
> >> >>>
> >> >>> I'm only aware of a blog post:
> >> >>> http://blog.llvm.org/2011/09/greedy-register-allocation-in-
> llvm-30.html
> >> >>> and a dev conference talk in 2011:
> >> >>> https://llvm.org/devmtg/2011-11/
> >> >>>
> >> >>> - Matthias
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>> Preston
> >> >>>
> >> >>>
> >> >>> On Mon, Sep 10, 2018 at 9:57 AM, Matthias Braun <mbraun at apple.com>
> wrote:
> >> >>>>
> >> >>>> I would not describe LLVMs register allocator as linear scan, it's
> closer to graph coloring than linear scan IMO (though doesn't really
> matcher either approach).
> >> >>>>
> >> >>>> RegAllocGreedy assigns the registers in an order based on the
> priority value computed in enqueu() we are not just scanning from top to
> bottom of the program. We also perform actual interference checks we just
> don't happen to build up an explicit interference graph up front.
> >> >>>>
> >> >>>> - Matthias
> >> >>>>
> >> >>>> On Sep 10, 2018, at 9:49 AM, Preston Briggs via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >> >>>>
> >> >>>> Why have we ended up using linear-scan register allocation
> >> >>>> by default (instead of, e.g., coloring)?
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Preston
> >> >>>>
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> LLVM Developers mailing list
> >> >>>> llvm-dev at lists.llvm.org
> >> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > LLVM Developers mailing list
> >> > llvm-dev at lists.llvm.org
> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/9ce5ce66/attachment.html>