[PATCH] D26872: Outliner: Add MIR-level outlining pass

Mon Feb 27 13:38:00 PST 2017

paquette marked 15 inline comments as done.
paquette added a comment.

In https://reviews.llvm.org/D26872#686155, @silvas wrote:

> ... 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
>  403.gcc 11s -> 66s
>  483.xalancbmk 16s -> 144s
>  (so about 5-10x slowdown of LLC due to the suffix tree)
>
> Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest...
>  ...
>  Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
>  Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Yeah that'd be a nasty case, and it's worth looking into, for sure.

Some quick ideas off the top of my head:

- Pre-prune nodes which can never lead to outlining candidates while setting suffix indices. This would still be vulnerable to programs that look like AABBCC, but may improve query time on average.
- Keep track of a collection of prospective "Outlining points". During the first traversal, if we find **anything** beneficial remember where it was. On the next traversal, if we have a next best point, start at that point instead of the root.
- Keep track of every beneficial substring, during one O(n) traversal, and prune overlaps choosing the most beneficial ones greedily.

In https://reviews.llvm.org/D26872#686155, @silvas wrote:

> ... it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).

I'll have to look into that and see what happens.

In https://reviews.llvm.org/D26872#686155, @silvas wrote:

> On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
>  ...
>  Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree).

Did you modify the benefit function to verify that removing length-2 instruction sequences actually removes most candidates? We could have found, say BC as the most beneficial, which would prune out all instances of ABC. There could very well be repeated instances of ABC that would be beneficial to outline. It might be possible to impose a minimum length restriction on x86 without losing all of the candidates.

In https://reviews.llvm.org/D26872#686155, @silvas wrote:

> What kinds of programs does this outliner perform well on?

In the test suite, the x86 outliner tended to do well on programs with heavy macro usage or automatically-generated code.

As you found, x86 is a particularly hostile environment for this sort of pass. :) It was just used for a proof of concept and for ease of testing. Most work for this pass should be done for other targets, like, say ARM64.

In https://reviews.llvm.org/D26872#686770, @silvas wrote:

> btw, down the road you may want to have this pass really know in detail the encoded length of each instruction on x86. There are quite a few *single instructions* that would be beneficial from a code size perspective to outline (if the outlined function is set to have alignment of 1). A quick analysis of an LLD binary (which contains all of LLVM linked in for LTO) shows there is over 5% code size savings just from outlining single instructions (since many x86 instructions encode to be larger than a CALL instruction which is 5 bytes). About half of the benefit (so about 2-3% of the total on this test case) comes from instructions that reference the stack via %rsp (mostly zeroing out stack slots), which could still be outlined if the offset was rewritten.

I would really love to do this, but I'm not sure if it's possible in LLVM at the moment. If it is, then I'll gladly add it in since I think it's probably one of the main reasons that x86 tests can get larger rather than smaller. The only thing that's (architecturally) tricky is that the target would need to know about the instruction-integer mappings. This could be done by moving the InstructionMapper over to the target, but I'm not sure if that's the best approach. If it's okay to do that, I doubt it'd be too difficult.

https://reviews.llvm.org/D26872