[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Wed Apr 8 14:10:12 PDT 2020

Cool!

I skipped to the end and tried to access the gdoc and the spreadsheet
but it did tell me I need permissions. Can you make them accessible or
am I the problem?

Thanks,
   Johannes

On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
 > TL;DR; We can improve compiler optimizations driven by heuristics by
 > replacing those heuristics with machine-learned policies (ML models).
 > Policies are trained offline and ship as part of the compiler. 
Determinism
 > is maintained because models are fixed when the compiler is operating in
 > production. Fine-tuning or regressions may be handled by 
incorporating the
 > interesting cases in the ML training set, retraining the compiler, and
 > redeploying it.
 >
 > For a first milestone, we chose inlining for size (-Oz) on X86-64. We 
were
 > able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
 > tip-of-tree. The trained model appears to generalize well over a diverse
 > set of binaries. Compile time is increased marginally (under 5%). The 
model
 > also happens to produce slightly better - performing code under 
SPEC2006 -
 > total score improvement by 1.75%. As we only wanted to verify there is no
 > significant regression in SPEC, and given the milestone goals, we haven’t
 > dug any deeper into the speed results.
 >
 > We see these results as promising, and as a reasonable point for
 > contributing our current work as a build-time opt-in to LLVM to 
benefit the
 > community, in the hope of fostering collaboration and learning from the
 > community’s feedback, as we try to better understand the trade-offs 
such an
 > approach entails, and as we work on expanding the depth and breadth of
 > applying these techniques to compiler optimization problems.
 >
 > https://reviews.llvm.org/D77752
 > Motivation
 >
 > Optimization problems, such as inlining or register allocation, are hard:
 > we don’t know of algorithms that are guaranteed to produce the optimum
 > solution in all cases in a timely fashion. Instead, we use heuristics:
 > algorithms that we expect to produce some approximation of the optimum,
 > with some expected degree of generality, in a practical amount of time.
 >
 > Heuristics have some common characteristics. Taking inlining as a case
 > study, it traverses the problem space in some way (bottom-up traversal of
 > the SCC graph), extracts some properties (let’s call them “features”) of
 > the program being optimized, and combine them with some weights (“tune”),
 > to produce a cost (InlineCost), which allows for trade-off analysis. We
 > validate the effectiveness of a heuristic over some accepted set of
 > benchmarks. Over time, we react to regressions or pathological cases
 > observed in the field, by manually analyzing such cases, figuring out an
 > enhancement to the heuristic, and then re-validating over that set of
 > benchmarks (maybe augmented by adding the newly found cases).
 >
 > Because heuristics are code that needs to be maintained, there is 
pressure
 > to reduce complexity: adding more features means we need to reason about
 > the interactions between the old and new features, which scales
 > combinatorially. Re-tuning because of the new features adds a similar 
kind
 > of complexity. Potentially, we miss out on optimization improvements as a
 > result.
 >
 > Because tuning is manual, there is pressure to keep the number of
 > benchmarks that can be studied in depth to a humanly-manageable size, 
which
 > potentially affects the generality of a heuristic or heuristic tuning.
 >
 > The main advantage of manual heuristics is arguably their relatively 
lower
 > overhead: no additional dependencies and more transparent to human 
analysis
 > and improvement.
 >
 > Machine learning, in particular reinforcement learning, can address the
 > tensions found in manual heuristics: once features are extracted from the
 > program, the way they are combined and tuned can easily be scaled up
 > through automation, improving effectiveness and generality. A major
 > drawback, at least at this point in time, of machine learning, is that we
 > don’t yet have a fully developed systematic approach for improving policy
 > effectiveness.
 > High level design
 >
 > We identify two scenarios for a compiler using ML policies: 
development and
 > release.
 >
 > The release scenario is equivalent to the regular compilation we have 
today
 > - the only difference is that it uses a pre-trained model (trained in the
 > development scenario beforehand) to make decisions instead of the
 > heuristics. Determinism is guaranteed since the model in the release
 > scenario is fixed. We imagine teams wishing to fine tune the 
effectiveness
 > of the optimization to their scenarios would train a different model.
 >
 > The decision previously evaluated using a human-crafted heuristic is
 > optionally replaced by:
 >
 >    -
 >
 >    a compiler-specific component, extracting features from IR (i.e. a
 >    vector of values)
 >    -
 >
 >    an evaluation of an ML model using those features, to obtain a result.
 >    In ML nomenclature, this is referred to using the model for 
inference (as
 >    opposed to training it)
 >
 > For example, when we replaced the decision of whether to inline a 
callsite,
 > the ML model produces a boolean (inline/don’t inline) based on a features
 > vector characterizing the call site and some broader module-wide context.
 >
 > Training/development is more complicated, and happens offline - akin to
 > how, today, attempts to improve an optimizing pass also happen offline. A
 > description of the high level design and the specifics we used for the
 > current scope are given in Appendix.
 > Current Scope
 >
 > The goal of our first milestone was to evaluate end to end an integration
 > of ML with LLVM, and get a first promising result. To that end, we chose
 > inlining for size (-Oz) as a stepping stone, as we perceived it to be 
more
 > likely to require a simpler evaluation setup than performance-oriented
 > optimizations might. At this point, we only train whether a call site may
 > be inlined or not, leaving the SCC traversal order as-is.
 >
 > We are proposing an initial change demonstrating the inference mechanism
 > using a pre-trained model, as a build-time opt-in to llvm. The compiler
 > components needed to perform training are also included in this first
 > change. Subsequent changes would include more training-related 
components.
 >
 > At a high level, the changes we are proposing consist of:
 >
 >    1.
 >
 >    a new module analysis pass, InliningAdvisor. By default, its
 >    implementation does nothing.
 >    2.
 >
 >    minimal hooks into Inliner.cpp.
 >    3.
 >
 >    the implementation of InliningAdvisor, activated when we opt-in 
ML. This
 >    is available in Analysis/ML, together with:
 >    1.
 >
 >       Rel/Dev specific ML model handing, also under Analysis/ML
 >       2.
 >
 >       a pre-trained model for inlining for size
 >       (Analysis/ML/models/inlining)
 >       3.
 >
 >       a pre-trained model for predicting native size from IR
 >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
 >       4.
 >
 >    Some refactorings in PassBuilder, to allow opting into running 
mandatory
 >    inlining first - some compilation speedup for the ML case, minimal,
 >    noise-like size effect. Also simplifies testing (these would be 
introduced
 >    as a preliminary patch).
 >
 > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
 > involved.
 > Inference Opt-In Mechanism
 >
 > The feature is primarily controlled by the cmake flag
 > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
 > “Rel”ease case requires specifying the location of the pip tensorflow
 > package (currently, that’s tf_nightly, and it should soon be available in
 > tensorflow)
 >
 > To opt in the ‘Rel’ case:
 >
 >    1.
 >
 >    install tensorflow pip package
 >
 > pip3 install tf_nightly --user
 >
 >    1.
 >
 >    configure llvm build
 >
 > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
 >
 > -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
 >
 > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
 > usually in the path>}
 >
 >
 >
 >    1.
 >
 >    build llvm as usual.
 >    2.
 >
 >    pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
 >    clang.
 >
 > Details
 >
 > The ML model is captured as a TensorFlow ‘saved model’. When building 
llvm,
 > we use TensorFlow’s  XLA native compiler (saved_model_cli) to compile the
 > saved model into a native static library and a header file. Insofar 
as LLVM
 > is concerned, there are minimal additional runtime requirements, packaged
 > as source within the pip package: C++ wrappers around the compiled model.
 > These will also be statically linked in the LLVM target. The compiled 
code
 > is otherwise just a complex arithmetical computation, with no special
 > requirements - it is single threaded and runs natively on the targeted
 > architecture. Together with the aforementioned runtime dependencies, it
 > adds ~115KB to the clang binary (0.08% increase)
 >
 > Runtime-wise, we observed a ~10% increase in the time spent in the 
inliner,
 > for a large (33MB) binary IR module; inlining typically consumes 
~10-15% of
 > total compilation time, so the overall compile time overhead of the
 > approach is arguably negligible. This cost is almost in entirety
 > attributable to feature extraction.
 >
 > Memory-wise, the precompiled model has a fixed size buffer for its 
inputs,
 > and performs a fixed amount of computations, so the memory overhead
 > inherent to our approach is independent from the program being optimized.
 > Using a small example to avoid effects such as memory use differences due
 > to different inlinings, we observed an 300KB increase in the maximum
 > resident size.
 >
 > A table showing effect on -Oz compiled binaries’ size is given in 
Appendix.
 > Next directions
 >
 > Our next milestone has two main high level goals: detailing a systematic
 > approach to driving policy effectiveness; and exploring in depth the type
 > of features and training algorithms most appropriate for compiler 
problems,
 > or at least problems like inlining. For the latter, we expect embedding
 > more of the call graph structure to play an important role, as well as,
 > potentially, delegating the problem space traversal to the ML model.
 >
 > We plan to include inlining for speed as part of our work on these goals.
 > AppendixTraining - High Level
 >
 > We use Reinforcement Learning (RL) to train the Inline-for-size 
model. At a
 > high level, it is composed of 3 parts: training data collection, model
 > training, and iterative data collection/model training. We use TensorFlow
 > as our ML framework.
 >
 > Related, we also needed to learn a separate model to evaluate the native
 > size of a function, given its IR, in order to calculate a more precise
 > reward for the reinforcement learning algorithm (“IR2Native”). We 
evaluated
 > ‘just counting IR’ and TargetTransformInfo, but they appeared to provide
 > too noisy of a signal for the reward, insofar as the RL training 
algorithm
 > for the inlining model was concerned. This model is only used during
 > training.
 >
 > RL - Training data collection: the training data we need to feed into a
 > reinforcement learning algorithm are sequences consisting of: state 
of the
 > problem (i.e. features); action (inline/not inline), and reward (native
 > size shrinkage after inline/not inline, using ir2native). To collect the
 > sequences, we hook the logging infrastructure into LLVM Inliner that is
 > able to produce logs after the inline optimization pass.
 >
 > RL - Model training: We use DQN (Deep Q-Network) to train our
 > inlining-for-size ML policy. On a high level, the DQN algorithm trains a
 > neural network to predict the value of different actions --- the DQN 
policy
 > then chooses to take the action with the highest predicted value. In our
 > scenario, we have two actions: 1) inline; 2) not inline, so the neural
 > network predicts the size reduction of these two actions based on 
features,
 > and then decides to conduct inlining if the neural network believes doing
 > inlining leads to higher size reduction. After the training finishes, it
 > produces a TensorFlow SavedModel that takes features as input and 
generates
 > inline decisions (whether to inline or not) as output.
 >
 > The choice of the features and reward are essential in model 
training. The
 > features are chosen with the consideration of being helpful in making the
 > decision --- the input to the cost model is a good starting point. 
Ideally,
 > the reward in the inline-for-size problem is the native size shrinkage
 > after inline/not inline. It is difficult to obtain precisely, so we 
use the
 > estimate as an alternative. This means that, for training, we also need a
 > model (not necessarily machine learned) for estimating rewards.
 >
 > RL - Iterative data collection/model training: Reinforcement learning is
 > ideally an iterative model/policy improvement process that: 1) the 
trained
 > model is deployed to the field to collect new data; 2) newly 
collected data
 > are used to update the model. Thus, we need to do iterative data
 > collection/model training. To facilitate data collection (avoid complex
 > build dependencies and time spent before/after the pass being 
trained), we
 > isolate out IR modules captured right before the optimization we are
 > interested in, and iterate on them with opt. We bootstrap the 
training from
 > the current heuristic, and stop the process once we are happy with the
 > outcome.
 >
 > IR2Native: We collect IR features (different from the ones used for
 > inlining) for each function at the end of inlining, right before we 
perform
 > function simplification passes, and right after. This means we have 
two IR
 > ‘shapes’ of the same function, and we know no further inlinings will be
 > performed, so whatever changes happen are based on that IR. We then 
extract
 > the native size at the end of compilation. Together, this data forms two
 > records per function that can be used in supervised learning - the 
features
 > are those extracted from IR, and the label is the native size. Training
 > IR2Native happens independently from the training of the inliner model.
 > Training support for the current scope
 >
 > The initial change includes the logging mechanism, an ir2native model
 > trained for x86-64, and the means to rapidly iterate over development ML
 > models. For the components that will be included in subsequent 
changes, the
 > rest of this section describes the mechanisms we employed. These 
components
 > are detailed further below.
 >
 > To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
 > API library <https://www.tensorflow.org/install/lang_c>. We then 
configure
 > the build:
 >
 > cmake .. -DLLVM_USE_ML_POLICY=Dev \
 >
 > -DLLVM_TF_C_LIB=<path to unarchived package> \
 >
 > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
 > library over, if it’s not on LD_LIBRARY_PATH}
 >
 >
 > To extract IR right before inlining, we hacked on top of the ThinLTO
 > infrastructure, interrupting its pre-link pipeline right before inlining.
 > This means clang produces binary IR files captured at that stage. We then
 > built a large target, obtaining a corpus of ~25K modules. We intend to
 > provide a clean mechanism in a subsequent change.
 >
 > To obtain features/labels for training this “IR to Native Size” model, we
 > had to make some changes to the AsmPrinter (to get real native sizes) and
 > llvm-readobj, as well as add an analysis pass for extracting the IR
 > features for this model. We plan on upstreaming these changes 
subsequently.
 >
 > Finally, the infrastructure driving the policy training is currently 
built
 > on proprietary APIs, as it benefits from a distributed computing
 > infrastructure. We are currently evaluating options for open sourcing it.
 > In the meantime, we are presenting the high level implementation details.
 >
 > To train a new model, the infrastructure performs 2 steps: extracting the
 > logs, and using them in a training algorithm.
 >
 > Log extraction is highly parallelizable: for each IR module in the 
training
 > corpus, we need to run opt once (or a few times, when we explore
 > improvements). Specifically, each run is this invocation:
 >
 > opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
 > ir2native> -training-log=<path to training log output> -enable-ml-inliner
 > -mandatory-inlinings-first -o <output> <module.o>
 >
 > Then collect the logs, and pass them to the next step.
 >
 >
 > Experimental results
 >
 > Experimental results are available as follows:
 >
 >
 >    -
 >
 >    SPEC2006
 > 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
 >    binary sizes (-Oz) and ‘test run’ scores.
 >    -
 >
 >    Size report
 > 
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
 >    from some internal benchmarks and binaries, including opt and clang
 >
 >
 > _______________________________________________
 > LLVM Developers mailing list
 > llvm-dev at lists.llvm.org
 > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev