[cfe-dev] libc++ red-black tree: halving the branches
hhinnant at apple.com
Wed Apr 4 20:04:17 PDT 2012
On Apr 4, 2012, at 8:58 PM, Sean Silva wrote:
> I came up with a neat little trick (that may have already been considered) that halves the number of conditionals in the venerable red-black tree rebalancing operations. The key observation is that there is a big `if` whose two branches are identical except that "left" and "right" are switched.
> Basically, you use pointer-to-members for each of `left` and `right`, and then a single swap reflects the case. You can see this in this experimental modification of the code for __tree_balance_after_insert <http://pastie.org/3730128> when compared to the current version (it uses a unified __tree_rotate that takes a pointer to member for what its notion of "left" should be).
Very clever. I believe this is good code to familiarize one's self with in order to add another good tool to the toolbox.
> In my quick experiments I found that this halved the number of branches in the generated code (iirc 26 to 13 or something like that).
In counting branches for performance comparisons (as opposed to code size comparisons) it is important to count the number of branches executed by each path through the routine, as opposed to the total number of branches. Your rewrite, at first glance, seems to reduce the total, but not the number of branches executed by any given path.
> The code size reduction wasn't as drastic, but it was still ~30% (I'm guessing the optimizer usually is doing some heroics to fold some commonalities, but still failing to recognize the fundamental sense of the commonality). Correct me if I'm wrong, but the way that __tree_rebalance_after_insert and friends are written means that only one copy of that routine should appear in the final binary, so the code size savings is a fixed constant; (or is the intent to have it get inlined into __tree<...>::__insert_node_at ?).
I believe you are correct, unless the client is using allocator smart pointers, which is exceedingly rare. Typically only the following will be instantiated:
So 30% savings on 1 routine, which I'm guessing is on the order of about 100 instructions. So save about 30 non-lined-and-instantiated-exactly-once instructions in code size.
> I don't have libcxx set up as my stdlib currently so I haven't been able to compile a large program (e.g. LLVM/clang) with it and see how things go down in that regard.
> Possible cons off the top of my head:
> - Increased register pressure. In the experimental version that I posted, the left and right pointer-to-members each get put in their own register, adding two registers of pressure. This can be improved by keeping just one of them and using xor to switch sides as necessary.
> - On architectures with only very simple addressing modes, this might require extra instructions for each memory access.
> I have yet to do any performance or holistic code-size measurements, but I want to solicit feedback on the idea to see if it might be a path worth investigating.
I don't believe code size is going to be a driver on this one as the potential savings tops out in the neighborhood of 30 instructions (which is small compared to the entire R/B tree container code). The bottom line is going to be performance. Your rewrite is interesting, but for me, significantly harder to read. So it will have to motivate with significant performance wins. And if it does, I think __tree_remove would benefit from the same treatment. However if we're looking at zero or marginal performance wins, then I favor the current code: it is easier to reason about and maintain.
More information about the cfe-dev