[llvm-dev] [RFC] BOLT: A Framework for Binary Analysis, Transformation, and Optimization

Sat Oct 31 17:52:49 PDT 2020

Hi everyone,

Sorry for being a bit late with a response I just wanted to point out that
at Yandex we used BOLT for the main search engine, it helped gain us 3-4%
of additional performance. My colleagues will be very happy to see this
technology merged into the LLVM toolchain.

Danila

>
> On Tue, Oct 20, 2020 at 11:39 AM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi All,
>>
>>
>>
>> Please find below an RFC for adding a binary optimization framework to
>> LLVM.
>>
>>
>>
>> Thanks for the feedback,
>>
>> Maksim & BOLT Team
>>
>>
>> [RFC] BOLT: A Framework for Binary Analysis, Transformation, and
>> Optimization
>>
>>
>> 1. Background and Motivation
>>
>>
>>
>> BOLT is a static post-link binary optimizer successfully used inside and
>> outside of Facebook for code optimizations that complement traditional
>> compiler whole-program and link-time optimizations [1]. Last year Google
>> reported that BOLT accelerates their key workloads by 2% to 6% [2].
>> Additionally, BOLT is used in academia as a framework for program
>> instrumentation, transformation, and binary analysis [3].
>>
>>
>>
>> We believe that including BOLT in the LLVM project will benefit the
>> community in several ways [4]. First, BOLT is useful as a binary optimizer.
>> It has been used at Facebook to accelerate the top services, and we would
>> love to see more people benefit from the performance boost that BOLT
>> brings. We would also love to see our partners adopt BOLT's new use-cases,
>> such as optimizing mobile and embedded applications. Beyond the binary
>> optimizer, BOLT is an impressive infrastructure that enables research in
>> the following areas:
>>
>>    - Advanced disassembly
>>       - Low-level program instrumentation
>>       - Static analysis
>>
>>
>> 2. Overview
>>
>>
>>
>> While BOLT uses several LLVM libraries [5], it is currently a separate
>> project based on an out-of-tree version of LLVM [6]. Most of the code lives
>> under separate tools/llvm-bolt directory, and required changes to LLVM are
>> included in the supplied patch [7]. The patch mainly consists of backported
>> fixes and minor extensions of the existing interfaces to update debug info,
>> frame information, and ORC.
>>
>>
>>
>> BOLT has no external build dependencies outside of LLVM. For profile
>> collection, we recommend using sampling with a Linux perf tool [8]. LBR
>> (last branch records) feature [9] is recommended as it improves profile
>> quality dramatically. BOLT can be supplied perf.data profile directly, but
>> in general, we advise converting the profile first using the supplied
>> perf2bolt utility. In case the sampling profiling is unavailable, e.g.,
>> while running under a hypervisor locally or in the cloud, BOLT provides a
>> builtin instrumentation.
>>
>>
>>
>> BOLT supports x86-64 ELF as the primary platform. We have also
>> implemented the initial support for Aarch64, and the support for MachO is
>> in the works.
>>
>>
>> 3. USE CASES
>>
>>
>> 3.1 Link-time and binary transformations and optimizations
>>
>>
>>
>> Static profile-driven post-link optimization of an application was our
>> primary goal when creating BOLT. The convenience and expandability that the
>> model offers perhaps could only be exceeded by a dynamic binary
>> optimization approach. E.g., to optimize a binary using a perf-generated
>> profile, the user has to execute a single command:
>>
>>
>>
>> $ llvm-bolt a.out -data perf.data -o a.out.bolt <optimization opts>
>>
>>
>>
>> No recompilation of a.out is needed (*), but we ask to link with
>> "--emit-relocs" for maximum performance gains. However, the latter is not a
>> strict requirement, and the command works even on stripped binaries.
>>
>>
>>
>> We have worked on reducing BOLT processing time and memory consumption.
>> Overall, BOLT efficiency has been improved by over 3X during that process.
>> It takes less than a minute to optimize HHVM [10] production binary with
>> over 100 MB of code and less than 3 minutes to optimize another
>> multi-gigabyte production binary with 500 MB of code. BOLT memory
>> consumption is under 5 GB for HHVM and under 13 GB for the large binary
>> (**).
>>
>>
>>
>> Fast turn-around times for optimizing an application with BOLT without
>> the need for source code allow us to design a system that automatically
>> manages binary profiling and optimization. This is one of the most exciting
>> directions in application optimization we are currently pursuing.
>>
>>
>>
>> We thought it would be interesting to perform a fresh comparison of BOLT
>> overhead to Propeller [11]. Sadly, Propeller relies on a custom version of
>> an external create-llvm-prof tool that we could not build in our setup, and
>> using a GitHub-hosted binary version of that tool in the virtual
>> environment produced a profile that caused a performance regression of the
>> optimized application. Using "-fbasic-block-sections=all" Propeller option
>> without a profile resulted in fast linking times but also caused a
>> performance regression.
>>
>>
>>
>> Although code reordering is the primary optimization in BOLT and the
>> source of most performance gains, it is not the only optimization in BOLT.
>> The full list of optimizations includes late inlining, indirect call
>> promotion, frame optimizations, and others. The convenience of adding new
>> optimizations in whole-program post-link mode is one of the main advantages
>> that the BOLT framework offers, be it for research or practical purposes.
>>
>>
>>
>> Additionally, BOLT can add new code to a linked ELF binary. We have
>> recently used that capability to allocate frequently-executed code on huge
>> pages automatically. Even legacy applications can now use 2MB pages for
>> code on x86-64 Linux systems to reduce the number of iTLB misses.
>>
>>
>>
>> BOLT's ability to process and optimize binary code without source code,
>> such as legacy/distributed binaries, or code from third-party and
>> assembly-written code, provides another advantage over a pure
>> compiler-based approach. This advantage could or could not be important for
>> optimizations depending on the user scenario. However, the visibility into
>> the code mentioned above could be critical for other code re-writing cases
>> such as vulnerability mitigations, instrumentation, and static analysis.
>>
>>
>>
>> *) Support for input with split functions is in the works. Before it is
>> completed, we ask not to use "-freorder-blocks-and-partition" compiler
>> option during the build. A similar or better result will be achieved by
>> running the BOLT function splitting pass.
>>
>>
>>
>> **) while running BOLT with "-reorder-blocks=cache+
>> -reorder-functions=hfsort -split-functions=1 -split-eh" optimization
>> options. DWARF update takes extra time and memory.
>>
>>
>> 3.2 Advanced Disassembly
>>
>>
>>
>> Internally, BOLT identifies code in the binary, breaks it into functions,
>> disassembles, and uses static analysis to build a control flow graph. This
>> functionality could complement a traditional disassembler, as it adds the
>> ability to display possible targets for indirect jumps/calls, providing a
>> better understanding of the control flow in a function.
>>
>>
>>
>> Additionally, for performance analysis, the disassembly is annotated with
>> execution counts, including those for indirect branch targets within a
>> function (jump tables) and across functions (indirect and virtual function
>> calls). E.g.:
>>
>>
>>
>>   <Basic Block> .LFT35985
>>
>>   Exec Count : 42
>>
>>   Predecessors: .Ltmp935657
>>
>>       00003c8b: leaq "JUMP_TABLE/foo/1.14"(%rip), %rcx
>>
>>       00003c92: movslq (%rcx,%rax,4), %rax
>>
>>       00003c96: addq %rcx, %rax
>>
>>       00003c99: jmpq *%rax # JUMPTABLE @0x6e0f94
>>
>>   Successors: .Ltmp935702 (count: 0, mispreds: 0),
>>
>>               .Ltmp935705 (count: 41, mispreds: 26),
>>
>>               .Ltmp935703 (count: 0, mispreds: 0),
>>
>>               .Ltmp935704 (count: 1, mispreds: 0)
>>
>> ....
>>
>>   <BasicBlock>.LFT43 (9 instructions, align : 1)
>>
>>   Exec Count : 8
>>
>>   Predecessors: .LBB01191
>>
>>       00000028: movq %rdx, %rbx
>>
>>       0000002b: leaq 0x20(%rsp), %rdi
>>
>>       00000030: movl $0x2, %edx
>>
>>       00000035: movq %rbx, %rsi
>>
>>       00000038: callq *%rax # CallProfile: 8 (8 mispreds) :
>>
>>               { f1: 3 (3 mispreds) },
>>
>>               { f2: 1 (1 mispreds) },
>>
>>               { f3: 4 (4 mispreds) }
>>
>>       0000003a: movdqu 0x10(%rbx), %xmm0
>>
>>       0000003f: movdqu %xmm0, 0x30(%rsp)
>>
>>       00000045: movq %xmm0, %rcx
>>
>>       0000004a: jmp .Ltmp26968
>>
>>   Successors: .Ltmp26968 (count: 8, mispreds: 0)
>>
>>
>>
>> With LLVM integration, the advanced disassembler can be added as a new
>> standalone tool or as an extra feature to existing tools such as
>> *llvm-objdump*.
>>
>>
>> 3.3 Low-Level Program Instrumentation
>>
>>
>>
>> Tools, such as memory sanitizers, rely on compiler-generated memory
>> instrumentation to detect application errors at runtime. Suppose part of
>> the application is written in assembly or comes from a library with no
>> sources. In that case, such code could become a source of false positives
>> and false negatives depending on the memory access types missed by the
>> instrumentation. Since BOLT can instrument arbitrary code, independent of
>> the source, it can complement compiler-based instrumentation and fill in
>> the missing parts leading to a higher quality signal from the tool.
>>
>>
>> 3.4 Static Analysis
>>
>>
>>
>> BOLT offers an intuitive API to perform compiler-level analyses directly
>> on machine code. Because BOLT reconstructs the control-flow graph of the
>> program, it allows pass writers to extract arbitrary information beyond the
>> scope of a single basic block with data-flow analyses. As an example, we
>> can perform shrink wrapping by checking how stack-accessing instructions
>> are using the frame in a given function and validating opportunities to
>> move memory accesses in hot basic blocks to colder areas.
>>
>>
>>
>> While this framework provides the optimization writer with greater reach
>> over previously opaque external third-party binary code linked in the
>> binary, perhaps the greatest value of this is in this code being visible at
>> all. With static analysis, users can write passes to draw conclusions on
>> safety concerns as well, such as checking if system or library calls are
>> being used in a suspicious way without the need to execute the binary.
>>
>>
>> 4. Plans
>>
>>
>>
>> BOLT is a mature project that has been used in production for years, but
>> we continue to develop BOLT and invest in new use-cases and optimizations.
>> Below is a shortlist of areas that are currently under development:
>>
>>    1. Automatic profile collection and optimization
>>    2. MachO support
>>    3. LLD integration
>>    4. Optimizing Linux kernel
>>
>>
>> --
>>
>> BOLT Team
>>
>>
>> References
>>
>> [1] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni.
>> 2019. BOLT: a practical binary optimizer for data centers and beyond. In
>> "Proceedings of the 2019 IEEE/ACM International Symposium on Code
>> Generation and Optimization" (CGO 2019). IEEE Press, 2–14.
>> https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/
>>
>> [2] https://lists.llvm.org/pipermail/llvm-dev/2019-September/135393.html
>>
>> [3] Joe Savage and Timothy M. Jones. 2020. HALO: post-link heap-layout
>> optimisation. In "Proceedings of the 18th ACM/IEEE International Symposium
>> on Code Generation and Optimization" (CGO 2020). Association for Computing
>> Machinery, New York, NY, USA, 94–106. DOI:
>> https://doi.org/10.1145/3368826.3377914
>>
>> [4] https://github.com/facebookincubator/BOLT/issues/46
>>
>> [5] Maksim Panchenko. 2016. Building Binary Optimizer with LLVM. 2016
>> European LLVM Developers' Meeting.
>> https://llvm.org/devmtg/2016-03/#presentation7
>>
>> [6] https://github.com/facebookincubator/BOLT
>>
>> [7] https://github.com/facebookincubator/BOLT/blob/master/llvm.patch
>>
>> [8] perf: Linux profiling with performance counters.
>> https://perf.wiki.kernel.org/index.php/Main_Page.
>>
>> [9] Runtime Performance Optimization Blueprint: Intel® Architecture
>> Optimizations with LBR.
>> https://software.intel.com/content/www/us/en/develop/articles/runtime-optimization-blueprint-IA-optimization-with-last-branch-record.html
>>
>> [10] The HipHop Virtual Machine. https://hhvm.com/
>>
>> [11] Propeller RFC.
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf.
>> Updated performance results:
>> https://lists.llvm.org/pipermail/llvm-dev/2019-October/136030.html
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201101/aebbfaa6/attachment.html>