[early patch] Speed up decl chaining

Sat Oct 19 14:41:25 PDT 2013

On Sat, Oct 19, 2013 at 8:22 AM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:

> The code hot path showed up in a build of postgresql.
>
> Synthetic benchmarks like these do have their value. They expose bad
> asymptotic behaviour that does show up in user code, but is harder to
> measure.
>

How is it hard to measure? Just look at the distribution of redeclaration
chain lengths. What I'm trying to get across is that focusing on asymptotic
complexity if the overwhelming majority of cases are "constant-sized" seems
a bit misguided. It's always possible to add a fallback mechanism to
guarantee good asymptotic complexity. It's the same principle as
SmallVector: you ensure that a specific common case is very fast, and fall
back to a slower version when the assumptions that enable the optimization
fail.

The method I suggested for packing bits of the address that is O(n) links
away into the low bits of each link is kind of a hack, but it *does*
guarantee constant time access to that node.

>
> For example, when this benchmark first came to being, the linkage
> computation was non linear and dominated. Fixing it helped existing
> code and moved the hot spot to decl linking. It looks like the hot
> path is back to linkage computation, and we are still a lot slower
> than gcc on this one, so fixing decl chaining will make this a good
> linkage benchmark again.
>
> Unbounded super linear algorithms in general provide a minefield that
> is not very user friendly.
>

Your patch doesn't seem to affect the asymptotic complexity of anything
though: a plain singly linked list (with a marked "head") will always have
an operation that is O(n) (and we seem to use both, so there's no escaping
this O(n) without altering the data structure). It seems like all this
patch is doing is switching which case is O(n) because a single
microbenchmark seems to hit on a particular traversal direction very
heavily; do you have a reason to believe that there isn't some other
microbenchmark that is now becoming superlinear?

-- Sean Silva

>
>
> On 19 October 2013 02:25, Sean Silva <silvas at purdue.edu> wrote:
> >
> >
> >
> > On Tue, Oct 8, 2013 at 11:09 PM, Rafael Espíndola
> > <rafael.espindola at gmail.com> wrote:
> >>
> >> I found this old incomplete patch while cleaning my git repo. I just
> >> want to see if it is crazy or not before trying to finish it.
> >
> >
> > What originally motivated this? Did you measure something that made you
> > think that this had the potential to be faster?
> >
> >>
> >>
> >> Currently decl chaining is O(n). We use a circular singly linked list
> >> that points to the previous element and has a bool to say if we are
> >> the first element (and actually point to the last).
> >>
> >> Adding a new decl is O(n) because we have to find the first element by
> >> walking the prev links. One way to make this O(1) that is sure to work
> >> is a doubly linked list, but that would be very wasteful in memory.
> >>
> >> What this patch does is reverse the list so that a decl points to the
> >> next decl (or to the first if it is the last). With this chaining
> >> becomes O(1). The flip side is that getPreviousDecl is now O(n).
> >>
> >> In this patch I just got check-clang to work and replaced enough uses
> >> of getPreviousDecl to get a speedup in
> >>
> >>     #define M extern int a;
> >>     #define M2 M M
> >>     #define M4 M2 M2
> >>     #define M8 M4 M4
> >>     #define M16 M8 M8
> >>     #define M32 M16 M16
> >>     #define M64 M32 M32
> >>     #define M128 M64 M64
> >>     #define M256 M128 M128
> >>     #define M512 M256 M256
> >>     #define M1024 M512 M512
> >>     #define M2048 M1024 M1024
> >>     #define M4096 M2048 M2048
> >>     #define M8192 M4096 M4096
> >>     #define M16384 M8192 M8192
> >>     M16384
> >>
> >> In my machine this patch takes clang -cc1 on the pre processed version
> >> of that from 0m4.748s to 0m1.525s.
> >
> >
> > What is this microbenchmark even measuring? Is there any reason to
> believe
> > that this is representative enough of anything to guide a decision?
> >
> > I feel like what's missing here are measurements of the actual behavior
> of
> > this code path. For example, how long are we spending walking these
> > redeclaration chains in real code? On average how long are the
> redeclaration
> > chains when compiling real code? Almost always 1? Usually 2? Generally
> > between 3 and 5? >100? Each of the cases I just listed puts the
> situation in
> > a completely different light. Are any particular sites that call these
> API's
> > (or particular AST classes) inducing far more link traversals than other
> > sites when compiling typical code? (i.e., instrument the "get next link"
> > routine to tally up by call site). Maybe the usage patterns of some AST
> > nodes benefit more from forward traversal, and others from backward?
> >
> >
> > Side note (completely impractical): if you have spare bits in the bottom
> of
> > the pointer, then you could store bits of the address of the first decl
> (or
> > whichever one is O(n) links away) in each link, so that in the worst case
> > you only have to walk a constant number of links before you collect all
> the
> > bits of the first pointer :)
> >
> > -- Sean Silva
> >
> >>
> >>
> >> There are still a lot of uses of getPreviousDecl to go, but can anyone
> >> see a testecase where this strategy would not work?
> >>
> >> Cheers,
> >> Rafael
> >>
> >> _______________________________________________
> >> cfe-commits mailing list
> >> cfe-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20131019/78d60771/attachment.html>