[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
Mircea Trofin via llvm-dev
llvm-dev at lists.llvm.org
Wed Apr 8 14:34:06 PDT 2020
Unfortunately I needed to update the links. Here they are, hopefully these
work correctly.
SPEC2006
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true>
Internal benchmarks, clang, opt
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true>
Thanks!
On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com> wrote:
> It turns out it's me, sorry. Let me see how I can sort this out. In the
> meantime, here is the csv:
>
> SPEC2006 data:
>
> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
> scores,perf: ML -Oz scores,ML improvement by
> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
> ,,,,,,,
> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%
>
>
> various benchmarks:
>
> binary,base -Oz size,ML -Oz size,ML shrink over base
> llvm:opt,66318624,62124256,6.32%
> math_1,6181552,5884144,4.81%
> event_management_1,4998728,4802696,3.92%
> llvm:lcalsBRaw,1270168,1222168,3.78%
> llvm:lcalsBLambda,1270232,1222232,3.78%
> llvm:lcalsARaw,1276248,1228248,3.76%
> llvm:lcalsALambda,1276440,1228504,3.76%
> llvm:lcalsCRaw,1278936,1231000,3.75%
> llvm:lcalsCLambda,1279064,1231256,3.74%
> llvm:Blur,1236888,1191192,3.69%
> image_processing_diffusion,1236120,1190488,3.69%
> llvm:Interpolation,1237208,1191576,3.69%
> llvm:harris,1237208,1191768,3.67%
> llvm:Dither,1238232,1192792,3.67%
> event_management_2,4741096,4568584,3.64%
> hashing_1,5189360,5004176,3.57%
> compression_2,5329392,5140496,3.54%
> math_2,6072336,5858992,3.51%
> crypto_1,4721576,4556008,3.51%
> math_4,27720208,26755568,3.48%
> math_3,27732496,26767920,3.48%
> infra_1,30673232,29630464,3.40%
> hashing_2,5834544,5637168,3.38%
> image_processing_entropy,5854960,5657008,3.38%
> hashing_3,5831088,5634288,3.38%
> memory_management_1,4957960,4790632,3.37%
> hasing_4,5816048,5619888,3.37%
> data_storage_2,5852976,5655792,3.37%
> compression_1,5971984,5771120,3.36%
> arena_unittest,6672944,6453584,3.29%
> data_storage_3,44113232,42710960,3.18%
> data_storage_1,22858992,22132960,3.18%
> datastructures_1,6362032,6161264,3.16%
> infra_2,56828720,55053496,3.12%
> datastructures_2,164558704,160790320,2.29%
> llvm:clang,77027416,75672088,1.76%
> benchmark_infra,20221424,19944432,1.37%
> search_1,169475360,167645632,1.08%
> protobuf_3,24071504,23895440,0.73%
> protobuf_1,24071504,23895504,0.73%
> protobuf_2,24071504,23895504,0.73%
> protobuf_4,24071504,23895504,0.73%
> protobuf_5,24071504,23895504,0.73%
> protobuf_6,23982800,23809424,0.72%
>
> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>>
>> Cool!
>>
>> I skipped to the end and tried to access the gdoc and the spreadsheet
>> but it did tell me I need permissions. Can you make them accessible or
>> am I the problem?
>>
>> Thanks,
>> Johannes
>>
>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>> > TL;DR; We can improve compiler optimizations driven by heuristics by
>> > replacing those heuristics with machine-learned policies (ML models).
>> > Policies are trained offline and ship as part of the compiler.
>> Determinism
>> > is maintained because models are fixed when the compiler is operating
>> in
>> > production. Fine-tuning or regressions may be handled by
>> incorporating the
>> > interesting cases in the ML training set, retraining the compiler, and
>> > redeploying it.
>> >
>> > For a first milestone, we chose inlining for size (-Oz) on X86-64. We
>> were
>> > able to train an ML model to produce binaries 1.5-6% smaller than -Oz
>> of
>> > tip-of-tree. The trained model appears to generalize well over a
>> diverse
>> > set of binaries. Compile time is increased marginally (under 5%). The
>> model
>> > also happens to produce slightly better - performing code under
>> SPEC2006 -
>> > total score improvement by 1.75%. As we only wanted to verify there is
>> no
>> > significant regression in SPEC, and given the milestone goals, we
>> haven’t
>> > dug any deeper into the speed results.
>> >
>> > We see these results as promising, and as a reasonable point for
>> > contributing our current work as a build-time opt-in to LLVM to
>> benefit the
>> > community, in the hope of fostering collaboration and learning from the
>> > community’s feedback, as we try to better understand the trade-offs
>> such an
>> > approach entails, and as we work on expanding the depth and breadth of
>> > applying these techniques to compiler optimization problems.
>> >
>> > https://reviews.llvm.org/D77752
>> > Motivation
>> >
>> > Optimization problems, such as inlining or register allocation, are
>> hard:
>> > we don’t know of algorithms that are guaranteed to produce the optimum
>> > solution in all cases in a timely fashion. Instead, we use heuristics:
>> > algorithms that we expect to produce some approximation of the optimum,
>> > with some expected degree of generality, in a practical amount of time.
>> >
>> > Heuristics have some common characteristics. Taking inlining as a case
>> > study, it traverses the problem space in some way (bottom-up traversal
>> of
>> > the SCC graph), extracts some properties (let’s call them “features”)
>> of
>> > the program being optimized, and combine them with some weights
>> (“tune”),
>> > to produce a cost (InlineCost), which allows for trade-off analysis. We
>> > validate the effectiveness of a heuristic over some accepted set of
>> > benchmarks. Over time, we react to regressions or pathological cases
>> > observed in the field, by manually analyzing such cases, figuring out
>> an
>> > enhancement to the heuristic, and then re-validating over that set of
>> > benchmarks (maybe augmented by adding the newly found cases).
>> >
>> > Because heuristics are code that needs to be maintained, there is
>> pressure
>> > to reduce complexity: adding more features means we need to reason
>> about
>> > the interactions between the old and new features, which scales
>> > combinatorially. Re-tuning because of the new features adds a similar
>> kind
>> > of complexity. Potentially, we miss out on optimization improvements
>> as a
>> > result.
>> >
>> > Because tuning is manual, there is pressure to keep the number of
>> > benchmarks that can be studied in depth to a humanly-manageable size,
>> which
>> > potentially affects the generality of a heuristic or heuristic tuning.
>> >
>> > The main advantage of manual heuristics is arguably their relatively
>> lower
>> > overhead: no additional dependencies and more transparent to human
>> analysis
>> > and improvement.
>> >
>> > Machine learning, in particular reinforcement learning, can address the
>> > tensions found in manual heuristics: once features are extracted from
>> the
>> > program, the way they are combined and tuned can easily be scaled up
>> > through automation, improving effectiveness and generality. A major
>> > drawback, at least at this point in time, of machine learning, is that
>> we
>> > don’t yet have a fully developed systematic approach for improving
>> policy
>> > effectiveness.
>> > High level design
>> >
>> > We identify two scenarios for a compiler using ML policies:
>> development and
>> > release.
>> >
>> > The release scenario is equivalent to the regular compilation we have
>> today
>> > - the only difference is that it uses a pre-trained model (trained in
>> the
>> > development scenario beforehand) to make decisions instead of the
>> > heuristics. Determinism is guaranteed since the model in the release
>> > scenario is fixed. We imagine teams wishing to fine tune the
>> effectiveness
>> > of the optimization to their scenarios would train a different model.
>> >
>> > The decision previously evaluated using a human-crafted heuristic is
>> > optionally replaced by:
>> >
>> > -
>> >
>> > a compiler-specific component, extracting features from IR (i.e. a
>> > vector of values)
>> > -
>> >
>> > an evaluation of an ML model using those features, to obtain a
>> result.
>> > In ML nomenclature, this is referred to using the model for
>> inference (as
>> > opposed to training it)
>> >
>> > For example, when we replaced the decision of whether to inline a
>> callsite,
>> > the ML model produces a boolean (inline/don’t inline) based on a
>> features
>> > vector characterizing the call site and some broader module-wide
>> context.
>> >
>> > Training/development is more complicated, and happens offline - akin to
>> > how, today, attempts to improve an optimizing pass also happen
>> offline. A
>> > description of the high level design and the specifics we used for the
>> > current scope are given in Appendix.
>> > Current Scope
>> >
>> > The goal of our first milestone was to evaluate end to end an
>> integration
>> > of ML with LLVM, and get a first promising result. To that end, we
>> chose
>> > inlining for size (-Oz) as a stepping stone, as we perceived it to be
>> more
>> > likely to require a simpler evaluation setup than performance-oriented
>> > optimizations might. At this point, we only train whether a call site
>> may
>> > be inlined or not, leaving the SCC traversal order as-is.
>> >
>> > We are proposing an initial change demonstrating the inference
>> mechanism
>> > using a pre-trained model, as a build-time opt-in to llvm. The compiler
>> > components needed to perform training are also included in this first
>> > change. Subsequent changes would include more training-related
>> components.
>> >
>> > At a high level, the changes we are proposing consist of:
>> >
>> > 1.
>> >
>> > a new module analysis pass, InliningAdvisor. By default, its
>> > implementation does nothing.
>> > 2.
>> >
>> > minimal hooks into Inliner.cpp.
>> > 3.
>> >
>> > the implementation of InliningAdvisor, activated when we opt-in
>> ML. This
>> > is available in Analysis/ML, together with:
>> > 1.
>> >
>> > Rel/Dev specific ML model handing, also under Analysis/ML
>> > 2.
>> >
>> > a pre-trained model for inlining for size
>> > (Analysis/ML/models/inlining)
>> > 3.
>> >
>> > a pre-trained model for predicting native size from IR
>> > (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
>> > 4.
>> >
>> > Some refactorings in PassBuilder, to allow opting into running
>> mandatory
>> > inlining first - some compilation speedup for the ML case, minimal,
>> > noise-like size effect. Also simplifies testing (these would be
>> introduced
>> > as a preliminary patch).
>> >
>> > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is
>> more
>> > involved.
>> > Inference Opt-In Mechanism
>> >
>> > The feature is primarily controlled by the cmake flag
>> > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
>> > “Rel”ease case requires specifying the location of the pip tensorflow
>> > package (currently, that’s tf_nightly, and it should soon be available
>> in
>> > tensorflow)
>> >
>> > To opt in the ‘Rel’ case:
>> >
>> > 1.
>> >
>> > install tensorflow pip package
>> >
>> > pip3 install tf_nightly --user
>> >
>> > 1.
>> >
>> > configure llvm build
>> >
>> > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>> >
>> > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
>> >
>> > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
>> > usually in the path>}
>> >
>> >
>> >
>> > 1.
>> >
>> > build llvm as usual.
>> > 2.
>> >
>> > pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
>> > clang.
>> >
>> > Details
>> >
>> > The ML model is captured as a TensorFlow ‘saved model’. When building
>> llvm,
>> > we use TensorFlow’s XLA native compiler (saved_model_cli) to compile
>> the
>> > saved model into a native static library and a header file. Insofar
>> as LLVM
>> > is concerned, there are minimal additional runtime requirements,
>> packaged
>> > as source within the pip package: C++ wrappers around the compiled
>> model.
>> > These will also be statically linked in the LLVM target. The compiled
>> code
>> > is otherwise just a complex arithmetical computation, with no special
>> > requirements - it is single threaded and runs natively on the targeted
>> > architecture. Together with the aforementioned runtime dependencies, it
>> > adds ~115KB to the clang binary (0.08% increase)
>> >
>> > Runtime-wise, we observed a ~10% increase in the time spent in the
>> inliner,
>> > for a large (33MB) binary IR module; inlining typically consumes
>> ~10-15% of
>> > total compilation time, so the overall compile time overhead of the
>> > approach is arguably negligible. This cost is almost in entirety
>> > attributable to feature extraction.
>> >
>> > Memory-wise, the precompiled model has a fixed size buffer for its
>> inputs,
>> > and performs a fixed amount of computations, so the memory overhead
>> > inherent to our approach is independent from the program being
>> optimized.
>> > Using a small example to avoid effects such as memory use differences
>> due
>> > to different inlinings, we observed an 300KB increase in the maximum
>> > resident size.
>> >
>> > A table showing effect on -Oz compiled binaries’ size is given in
>> Appendix.
>> > Next directions
>> >
>> > Our next milestone has two main high level goals: detailing a
>> systematic
>> > approach to driving policy effectiveness; and exploring in depth the
>> type
>> > of features and training algorithms most appropriate for compiler
>> problems,
>> > or at least problems like inlining. For the latter, we expect embedding
>> > more of the call graph structure to play an important role, as well as,
>> > potentially, delegating the problem space traversal to the ML model.
>> >
>> > We plan to include inlining for speed as part of our work on these
>> goals.
>> > AppendixTraining - High Level
>> >
>> > We use Reinforcement Learning (RL) to train the Inline-for-size
>> model. At a
>> > high level, it is composed of 3 parts: training data collection, model
>> > training, and iterative data collection/model training. We use
>> TensorFlow
>> > as our ML framework.
>> >
>> > Related, we also needed to learn a separate model to evaluate the
>> native
>> > size of a function, given its IR, in order to calculate a more precise
>> > reward for the reinforcement learning algorithm (“IR2Native”). We
>> evaluated
>> > ‘just counting IR’ and TargetTransformInfo, but they appeared to
>> provide
>> > too noisy of a signal for the reward, insofar as the RL training
>> algorithm
>> > for the inlining model was concerned. This model is only used during
>> > training.
>> >
>> > RL - Training data collection: the training data we need to feed into a
>> > reinforcement learning algorithm are sequences consisting of: state
>> of the
>> > problem (i.e. features); action (inline/not inline), and reward (native
>> > size shrinkage after inline/not inline, using ir2native). To collect
>> the
>> > sequences, we hook the logging infrastructure into LLVM Inliner that is
>> > able to produce logs after the inline optimization pass.
>> >
>> > RL - Model training: We use DQN (Deep Q-Network) to train our
>> > inlining-for-size ML policy. On a high level, the DQN algorithm trains
>> a
>> > neural network to predict the value of different actions --- the DQN
>> policy
>> > then chooses to take the action with the highest predicted value. In
>> our
>> > scenario, we have two actions: 1) inline; 2) not inline, so the neural
>> > network predicts the size reduction of these two actions based on
>> features,
>> > and then decides to conduct inlining if the neural network believes
>> doing
>> > inlining leads to higher size reduction. After the training finishes,
>> it
>> > produces a TensorFlow SavedModel that takes features as input and
>> generates
>> > inline decisions (whether to inline or not) as output.
>> >
>> > The choice of the features and reward are essential in model
>> training. The
>> > features are chosen with the consideration of being helpful in making
>> the
>> > decision --- the input to the cost model is a good starting point.
>> Ideally,
>> > the reward in the inline-for-size problem is the native size shrinkage
>> > after inline/not inline. It is difficult to obtain precisely, so we
>> use the
>> > estimate as an alternative. This means that, for training, we also
>> need a
>> > model (not necessarily machine learned) for estimating rewards.
>> >
>> > RL - Iterative data collection/model training: Reinforcement learning
>> is
>> > ideally an iterative model/policy improvement process that: 1) the
>> trained
>> > model is deployed to the field to collect new data; 2) newly
>> collected data
>> > are used to update the model. Thus, we need to do iterative data
>> > collection/model training. To facilitate data collection (avoid complex
>> > build dependencies and time spent before/after the pass being
>> trained), we
>> > isolate out IR modules captured right before the optimization we are
>> > interested in, and iterate on them with opt. We bootstrap the
>> training from
>> > the current heuristic, and stop the process once we are happy with the
>> > outcome.
>> >
>> > IR2Native: We collect IR features (different from the ones used for
>> > inlining) for each function at the end of inlining, right before we
>> perform
>> > function simplification passes, and right after. This means we have
>> two IR
>> > ‘shapes’ of the same function, and we know no further inlinings will be
>> > performed, so whatever changes happen are based on that IR. We then
>> extract
>> > the native size at the end of compilation. Together, this data forms
>> two
>> > records per function that can be used in supervised learning - the
>> features
>> > are those extracted from IR, and the label is the native size. Training
>> > IR2Native happens independently from the training of the inliner model.
>> > Training support for the current scope
>> >
>> > The initial change includes the logging mechanism, an ir2native model
>> > trained for x86-64, and the means to rapidly iterate over development
>> ML
>> > models. For the components that will be included in subsequent
>> changes, the
>> > rest of this section describes the mechanisms we employed. These
>> components
>> > are detailed further below.
>> >
>> > To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow
>> C
>> > API library <https://www.tensorflow.org/install/lang_c>. We then
>> configure
>> > the build:
>> >
>> > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>> >
>> > -DLLVM_TF_C_LIB=<path to unarchived package> \
>> >
>> > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
>> > library over, if it’s not on LD_LIBRARY_PATH}
>> >
>> >
>> > To extract IR right before inlining, we hacked on top of the ThinLTO
>> > infrastructure, interrupting its pre-link pipeline right before
>> inlining.
>> > This means clang produces binary IR files captured at that stage. We
>> then
>> > built a large target, obtaining a corpus of ~25K modules. We intend to
>> > provide a clean mechanism in a subsequent change.
>> >
>> > To obtain features/labels for training this “IR to Native Size” model,
>> we
>> > had to make some changes to the AsmPrinter (to get real native sizes)
>> and
>> > llvm-readobj, as well as add an analysis pass for extracting the IR
>> > features for this model. We plan on upstreaming these changes
>> subsequently.
>> >
>> > Finally, the infrastructure driving the policy training is currently
>> built
>> > on proprietary APIs, as it benefits from a distributed computing
>> > infrastructure. We are currently evaluating options for open sourcing
>> it.
>> > In the meantime, we are presenting the high level implementation
>> details.
>> >
>> > To train a new model, the infrastructure performs 2 steps: extracting
>> the
>> > logs, and using them in a training algorithm.
>> >
>> > Log extraction is highly parallelizable: for each IR module in the
>> training
>> > corpus, we need to run opt once (or a few times, when we explore
>> > improvements). Specifically, each run is this invocation:
>> >
>> > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
>> > ir2native> -training-log=<path to training log output>
>> -enable-ml-inliner
>> > -mandatory-inlinings-first -o <output> <module.o>
>> >
>> > Then collect the logs, and pass them to the next step.
>> >
>> >
>> > Experimental results
>> >
>> > Experimental results are available as follows:
>> >
>> >
>> > -
>> >
>> > SPEC2006
>> >
>> <
>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
>> >
>> > binary sizes (-Oz) and ‘test run’ scores.
>> > -
>> >
>> > Size report
>> >
>> <
>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
>> >
>> > from some internal benchmarks and binaries, including opt and clang
>> >
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200408/2fc26df2/attachment.html>
More information about the llvm-dev
mailing list