<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Oct 19, 2013 at 8:22 AM, Rafael Espíndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The code hot path showed up in a build of postgresql.<br>

<br>

Synthetic benchmarks like these do have their value. They expose bad<br>

asymptotic behaviour that does show up in user code, but is harder to<br>

measure.<br></blockquote><div><br></div><div>How is it hard to measure? Just look at the distribution of redeclaration chain lengths. What I'm trying to get across is that focusing on asymptotic complexity if the overwhelming majority of cases are "constant-sized" seems a bit misguided. It's always possible to add a fallback mechanism to guarantee good asymptotic complexity. It's the same principle as SmallVector: you ensure that a specific common case is very fast, and fall back to a slower version when the assumptions that enable the optimization fail.</div>

<div><br></div><div>The method I suggested for packing bits of the address that is O(n) links away into the low bits of each link is kind of a hack, but it *does* guarantee constant time access to that node.</div><div> </div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

For example, when this benchmark first came to being, the linkage<br>

computation was non linear and dominated. Fixing it helped existing<br>

code and moved the hot spot to decl linking. It looks like the hot<br>

path is back to linkage computation, and we are still a lot slower<br>

than gcc on this one, so fixing decl chaining will make this a good<br>

linkage benchmark again.<br>

<br>

Unbounded super linear algorithms in general provide a minefield that<br>

is not very user friendly.<br></blockquote><div><br></div><div>Your patch doesn't seem to affect the asymptotic complexity of anything though: a plain singly linked list (with a marked "head") will always have an operation that is O(n) (and we seem to use both, so there's no escaping this O(n) without altering the data structure). It seems like all this patch is doing is switching which case is O(n) because a single microbenchmark seems to hit on a particular traversal direction very heavily; do you have a reason to believe that there isn't some other microbenchmark that is now becoming superlinear?</div>

<div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

<br>

On 19 October 2013 02:25, Sean Silva <<a href="mailto:silvas@purdue.edu">silvas@purdue.edu</a>> wrote:<br>

><br>

><br>

><br>

> On Tue, Oct 8, 2013 at 11:09 PM, Rafael Espíndola<br>

> <<a href="mailto:rafael.espindola@gmail.com">rafael.espindola@gmail.com</a>> wrote:<br>

>><br>

>> I found this old incomplete patch while cleaning my git repo. I just<br>

>> want to see if it is crazy or not before trying to finish it.<br>

><br>

><br>

> What originally motivated this? Did you measure something that made you<br>

> think that this had the potential to be faster?<br>

><br>

>><br>

>><br>

>> Currently decl chaining is O(n). We use a circular singly linked list<br>

>> that points to the previous element and has a bool to say if we are<br>

>> the first element (and actually point to the last).<br>

>><br>

>> Adding a new decl is O(n) because we have to find the first element by<br>

>> walking the prev links. One way to make this O(1) that is sure to work<br>

>> is a doubly linked list, but that would be very wasteful in memory.<br>

>><br>

>> What this patch does is reverse the list so that a decl points to the<br>

>> next decl (or to the first if it is the last). With this chaining<br>

>> becomes O(1). The flip side is that getPreviousDecl is now O(n).<br>

>><br>

>> In this patch I just got check-clang to work and replaced enough uses<br>

>> of getPreviousDecl to get a speedup in<br>

>><br>

>>     #define M extern int a;<br>

>>     #define M2 M M<br>

>>     #define M4 M2 M2<br>

>>     #define M8 M4 M4<br>

>>     #define M16 M8 M8<br>

>>     #define M32 M16 M16<br>

>>     #define M64 M32 M32<br>

>>     #define M128 M64 M64<br>

>>     #define M256 M128 M128<br>

>>     #define M512 M256 M256<br>

>>     #define M1024 M512 M512<br>

>>     #define M2048 M1024 M1024<br>

>>     #define M4096 M2048 M2048<br>

>>     #define M8192 M4096 M4096<br>

>>     #define M16384 M8192 M8192<br>

>>     M16384<br>

>><br>

>> In my machine this patch takes clang -cc1 on the pre processed version<br>

>> of that from 0m4.748s to 0m1.525s.<br>

><br>

><br>

> What is this microbenchmark even measuring? Is there any reason to believe<br>

> that this is representative enough of anything to guide a decision?<br>

><br>

> I feel like what's missing here are measurements of the actual behavior of<br>

> this code path. For example, how long are we spending walking these<br>

> redeclaration chains in real code? On average how long are the redeclaration<br>

> chains when compiling real code? Almost always 1? Usually 2? Generally<br>

> between 3 and 5? >100? Each of the cases I just listed puts the situation in<br>

> a completely different light. Are any particular sites that call these API's<br>

> (or particular AST classes) inducing far more link traversals than other<br>

> sites when compiling typical code? (i.e., instrument the "get next link"<br>

> routine to tally up by call site). Maybe the usage patterns of some AST<br>

> nodes benefit more from forward traversal, and others from backward?<br>

><br>

><br>

> Side note (completely impractical): if you have spare bits in the bottom of<br>

> the pointer, then you could store bits of the address of the first decl (or<br>

> whichever one is O(n) links away) in each link, so that in the worst case<br>

> you only have to walk a constant number of links before you collect all the<br>

> bits of the first pointer :)<br>

><br>

> -- Sean Silva<br>

><br>

>><br>

>><br>

>> There are still a lot of uses of getPreviousDecl to go, but can anyone<br>

>> see a testecase where this strategy would not work?<br>

>><br>

>> Cheers,<br>

>> Rafael<br>

>><br>

>> _______________________________________________<br>

>> cfe-commits mailing list<br>

>> <a href="mailto:cfe-commits@cs.uiuc.edu">cfe-commits@cs.uiuc.edu</a><br>

>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>

>><br>

><br>

</div></div></blockquote></div><br></div></div>