[PATCH] D26872: Outliner: Add MIR-level outlining pass

Fri Feb 24 13:22:06 PST 2017

silvas added a comment.

I just tested on SPECCPU2006 (FullLTO) and no assertion failures!

However, 403.gcc and 483.xalancbmk (at least) seem to have a huge compile time slowdown (superlinear behavior?). Some rough numbers comparing LLC runtime:
403.gcc 11s -> 66s
483.xalancbmk 16s -> 144s
(so about 5-10x slowdown of LLC due to the suffix tree)

Most of the time seems to be spent inside buildCandidateList. Sampling a couple stacks it seems like it is stuck in findBest, usually just 1 or 2 stack frames in findBest and so at least the problem isn't that it is recursing too deeply.
I added some printfs to print out the depth and vertex degree of each node in the suffix tree for 483.xalancbmk and I got this: https://reviews.llvm.org/F3114496
F3114496: Screenshot from 2017-02-24 01:57:49.png <https://reviews.llvm.org/F3114496>
So it makes sense that typically one would be only 1 or 2 stack frames deep.

Modulo the pruning that is going on, we seem to do O(N) work in bestRepeatedSubstring once per outlining candidate. Is the pruning effective enough that the sum of all calls to bestRepeatedSubstring doesn't grow out of control? My suspicion is that it isn't, and I think a contrived case like AAABBBCCCDDD... (Assume "A" represents constant-size string large enough to be profitable to outline) will trigger O(N^2) behavior in the number of instructions in the module.
Is it possible to do algorithmically better? (exploiting suffix tree invariants maybe?)

Also, it looks like this pass actually increases (1-5%) text size on all of the SPEC binaries except for 401.bzip2: https://reviews.llvm.org/F3114409
F3114409: Screenshot from 2017-02-24 00:47:27.png <https://reviews.llvm.org/F3114409>

(I double and triple checked and I don't have it switched around; the raw data (doublechecked the labels are right) is: https://reviews.llvm.org/P7971)

Can you please find out why this isn't helping (and in fact is hurting)? Are better heuristics needed? At the very least, the cost function seems like it needs to be amended to take into account the true overheads.

In particular, it seems that the cost function does not take into account that the outlined functions will have some minimum alignment applied to them (or can you mark them as not requiring this alignment? still, it would end up depending on linker placement (alignment of adjacent sections) and such as to how much padding actually is inserted).
On 483.xalancbmk, the suffix tree based outliner find 2311 functoins to outline, and almost all of them are 2 instructions, which is typically less than 16 bytes, which is the minimum alignment that will be imposed (just from looking at the output binary).
A naive approach which just looks for identical runs of outlinable instructions (ignoring substrings) outlines 2391 functions (slightly more). The total benefit is somewhat greater for the suffix tree though at 29379 vs 27994 for the naive approach.
This appears to be due to the outliner finding many more length-2 sequences to outline: https://reviews.llvm.org/F3114690
F3114690: Screenshot from 2017-02-24 04:32:13.png <https://reviews.llvm.org/F3114690>

Overall, it seems like the vast majority of the benefit on 483.xalancbmk is due to extremely short instruction sequences. But if we are going to avoid very short instruction sequences because they actually aren't profitable, then most of the outlinable instructions disappear on this test case (and at a glance, the other SPEC benchmarks are pretty similar). I'd also like to note that this testing is with FullLTO, so it is a best-case scenario for the outliner (whole program visibility to the suffix tree). What kinds of programs does this outliner perform well on?

For reference, here are all the outlined functions from 483.xalancbmk: https://reviews.llvm.org/F3114805

One interesting thing is that they are almost all short sequences of `mov` instructions. Staring at the code that calls them, it's clear why this is: almost all of the outlined functions in 483.xalancbmk are in sequences like this:

  ...
    362a1e:`      e8 dd 20 0e 00       `  callq  444b00 <OUTLINED_FUNCTION2637142655534006531_61>
    362a23:`      e8 4c e8 09 00       `  callq  401274 <_ZN11xercesc_2_512XMLBufferMgr13releaseBufferERNS_9XMLBufferE>
  ...

(FWIW, I tried and IPRA doesn't actually decrease text size much on SPEC with FullLTO)

I.e. what has been outlined is function setup overhead. There are also quite a few outlined functions right before jumps, which are factoring out code sequences like this:

  00000000004a2eb0 <OUTLINED_FUNCTION2637142655534006531_2458>:
    4a2eb0:`      48 8b 41 18          `  mov    0x18(%rcx),%rax
    4a2eb4:`      48 85 c0             `  test   %rax,%rax
    4a2eb7:`      c3                   `  retq␣␣␣
    4a2eb8:`      0f 1f 84 00 00 00 00 `  nopl   0x0(%rax,%rax,1)
    4a2ebf:`      00␣

================
Comment at: lib/CodeGen/MachineOutliner.cpp:1031
+
+  // Check for overlaps in the range. This is O(n^2) worst case, but we can
+  // alleviate that somewhat by bounding our search space using the start
----------------
If I understand what this is doing correctly, it can be easily made less than O(N^2) by sorting ascending by Start and descending by End (SROA does something similar to do efficient overlap calculations).

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:10387
+
+unsigned X86InstrInfo::outliningBenefit(size_t SequenceSize,
+                                        size_t Occurrences) const {
----------------
This name does not follow the coding standard. Should be `getOutliningBenefit` or something

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:10400
+
+bool X86InstrInfo::functionIsSafeToOutlineFrom(MachineFunction &MF) const {
+  return MF.getFunction()->hasFnAttribute(Attribute::NoRedZone);
----------------
isFunctionSafeToOutlineFrom

https://reviews.llvm.org/D26872