[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM
Johannes Doerfert via llvm-dev
llvm-dev at lists.llvm.org
Wed Apr 8 14:10:12 PDT 2020
Cool!
I skipped to the end and tried to access the gdoc and the spreadsheet
but it did tell me I need permissions. Can you make them accessible or
am I the problem?
Thanks,
Johannes
On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
> TL;DR; We can improve compiler optimizations driven by heuristics by
> replacing those heuristics with machine-learned policies (ML models).
> Policies are trained offline and ship as part of the compiler.
Determinism
> is maintained because models are fixed when the compiler is operating in
> production. Fine-tuning or regressions may be handled by
incorporating the
> interesting cases in the ML training set, retraining the compiler, and
> redeploying it.
>
> For a first milestone, we chose inlining for size (-Oz) on X86-64. We
were
> able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
> tip-of-tree. The trained model appears to generalize well over a diverse
> set of binaries. Compile time is increased marginally (under 5%). The
model
> also happens to produce slightly better - performing code under
SPEC2006 -
> total score improvement by 1.75%. As we only wanted to verify there is no
> significant regression in SPEC, and given the milestone goals, we haven’t
> dug any deeper into the speed results.
>
> We see these results as promising, and as a reasonable point for
> contributing our current work as a build-time opt-in to LLVM to
benefit the
> community, in the hope of fostering collaboration and learning from the
> community’s feedback, as we try to better understand the trade-offs
such an
> approach entails, and as we work on expanding the depth and breadth of
> applying these techniques to compiler optimization problems.
>
> https://reviews.llvm.org/D77752
> Motivation
>
> Optimization problems, such as inlining or register allocation, are hard:
> we don’t know of algorithms that are guaranteed to produce the optimum
> solution in all cases in a timely fashion. Instead, we use heuristics:
> algorithms that we expect to produce some approximation of the optimum,
> with some expected degree of generality, in a practical amount of time.
>
> Heuristics have some common characteristics. Taking inlining as a case
> study, it traverses the problem space in some way (bottom-up traversal of
> the SCC graph), extracts some properties (let’s call them “features”) of
> the program being optimized, and combine them with some weights (“tune”),
> to produce a cost (InlineCost), which allows for trade-off analysis. We
> validate the effectiveness of a heuristic over some accepted set of
> benchmarks. Over time, we react to regressions or pathological cases
> observed in the field, by manually analyzing such cases, figuring out an
> enhancement to the heuristic, and then re-validating over that set of
> benchmarks (maybe augmented by adding the newly found cases).
>
> Because heuristics are code that needs to be maintained, there is
pressure
> to reduce complexity: adding more features means we need to reason about
> the interactions between the old and new features, which scales
> combinatorially. Re-tuning because of the new features adds a similar
kind
> of complexity. Potentially, we miss out on optimization improvements as a
> result.
>
> Because tuning is manual, there is pressure to keep the number of
> benchmarks that can be studied in depth to a humanly-manageable size,
which
> potentially affects the generality of a heuristic or heuristic tuning.
>
> The main advantage of manual heuristics is arguably their relatively
lower
> overhead: no additional dependencies and more transparent to human
analysis
> and improvement.
>
> Machine learning, in particular reinforcement learning, can address the
> tensions found in manual heuristics: once features are extracted from the
> program, the way they are combined and tuned can easily be scaled up
> through automation, improving effectiveness and generality. A major
> drawback, at least at this point in time, of machine learning, is that we
> don’t yet have a fully developed systematic approach for improving policy
> effectiveness.
> High level design
>
> We identify two scenarios for a compiler using ML policies:
development and
> release.
>
> The release scenario is equivalent to the regular compilation we have
today
> - the only difference is that it uses a pre-trained model (trained in the
> development scenario beforehand) to make decisions instead of the
> heuristics. Determinism is guaranteed since the model in the release
> scenario is fixed. We imagine teams wishing to fine tune the
effectiveness
> of the optimization to their scenarios would train a different model.
>
> The decision previously evaluated using a human-crafted heuristic is
> optionally replaced by:
>
> -
>
> a compiler-specific component, extracting features from IR (i.e. a
> vector of values)
> -
>
> an evaluation of an ML model using those features, to obtain a result.
> In ML nomenclature, this is referred to using the model for
inference (as
> opposed to training it)
>
> For example, when we replaced the decision of whether to inline a
callsite,
> the ML model produces a boolean (inline/don’t inline) based on a features
> vector characterizing the call site and some broader module-wide context.
>
> Training/development is more complicated, and happens offline - akin to
> how, today, attempts to improve an optimizing pass also happen offline. A
> description of the high level design and the specifics we used for the
> current scope are given in Appendix.
> Current Scope
>
> The goal of our first milestone was to evaluate end to end an integration
> of ML with LLVM, and get a first promising result. To that end, we chose
> inlining for size (-Oz) as a stepping stone, as we perceived it to be
more
> likely to require a simpler evaluation setup than performance-oriented
> optimizations might. At this point, we only train whether a call site may
> be inlined or not, leaving the SCC traversal order as-is.
>
> We are proposing an initial change demonstrating the inference mechanism
> using a pre-trained model, as a build-time opt-in to llvm. The compiler
> components needed to perform training are also included in this first
> change. Subsequent changes would include more training-related
components.
>
> At a high level, the changes we are proposing consist of:
>
> 1.
>
> a new module analysis pass, InliningAdvisor. By default, its
> implementation does nothing.
> 2.
>
> minimal hooks into Inliner.cpp.
> 3.
>
> the implementation of InliningAdvisor, activated when we opt-in
ML. This
> is available in Analysis/ML, together with:
> 1.
>
> Rel/Dev specific ML model handing, also under Analysis/ML
> 2.
>
> a pre-trained model for inlining for size
> (Analysis/ML/models/inlining)
> 3.
>
> a pre-trained model for predicting native size from IR
> (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode only.
> 4.
>
> Some refactorings in PassBuilder, to allow opting into running
mandatory
> inlining first - some compilation speedup for the ML case, minimal,
> noise-like size effect. Also simplifies testing (these would be
introduced
> as a preliminary patch).
>
> We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it is more
> involved.
> Inference Opt-In Mechanism
>
> The feature is primarily controlled by the cmake flag
> LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different dependencies. The
> “Rel”ease case requires specifying the location of the pip tensorflow
> package (currently, that’s tf_nightly, and it should soon be available in
> tensorflow)
>
> To opt in the ‘Rel’ case:
>
> 1.
>
> install tensorflow pip package
>
> pip3 install tf_nightly --user
>
> 1.
>
> configure llvm build
>
> cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>
> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
>
> {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed - it’s
> usually in the path>}
>
>
>
> 1.
>
> build llvm as usual.
> 2.
>
> pass -mllvm -enable-ml-inliner -mllvm -mandatory-inlinings-first to
> clang.
>
> Details
>
> The ML model is captured as a TensorFlow ‘saved model’. When building
llvm,
> we use TensorFlow’s XLA native compiler (saved_model_cli) to compile the
> saved model into a native static library and a header file. Insofar
as LLVM
> is concerned, there are minimal additional runtime requirements, packaged
> as source within the pip package: C++ wrappers around the compiled model.
> These will also be statically linked in the LLVM target. The compiled
code
> is otherwise just a complex arithmetical computation, with no special
> requirements - it is single threaded and runs natively on the targeted
> architecture. Together with the aforementioned runtime dependencies, it
> adds ~115KB to the clang binary (0.08% increase)
>
> Runtime-wise, we observed a ~10% increase in the time spent in the
inliner,
> for a large (33MB) binary IR module; inlining typically consumes
~10-15% of
> total compilation time, so the overall compile time overhead of the
> approach is arguably negligible. This cost is almost in entirety
> attributable to feature extraction.
>
> Memory-wise, the precompiled model has a fixed size buffer for its
inputs,
> and performs a fixed amount of computations, so the memory overhead
> inherent to our approach is independent from the program being optimized.
> Using a small example to avoid effects such as memory use differences due
> to different inlinings, we observed an 300KB increase in the maximum
> resident size.
>
> A table showing effect on -Oz compiled binaries’ size is given in
Appendix.
> Next directions
>
> Our next milestone has two main high level goals: detailing a systematic
> approach to driving policy effectiveness; and exploring in depth the type
> of features and training algorithms most appropriate for compiler
problems,
> or at least problems like inlining. For the latter, we expect embedding
> more of the call graph structure to play an important role, as well as,
> potentially, delegating the problem space traversal to the ML model.
>
> We plan to include inlining for speed as part of our work on these goals.
> AppendixTraining - High Level
>
> We use Reinforcement Learning (RL) to train the Inline-for-size
model. At a
> high level, it is composed of 3 parts: training data collection, model
> training, and iterative data collection/model training. We use TensorFlow
> as our ML framework.
>
> Related, we also needed to learn a separate model to evaluate the native
> size of a function, given its IR, in order to calculate a more precise
> reward for the reinforcement learning algorithm (“IR2Native”). We
evaluated
> ‘just counting IR’ and TargetTransformInfo, but they appeared to provide
> too noisy of a signal for the reward, insofar as the RL training
algorithm
> for the inlining model was concerned. This model is only used during
> training.
>
> RL - Training data collection: the training data we need to feed into a
> reinforcement learning algorithm are sequences consisting of: state
of the
> problem (i.e. features); action (inline/not inline), and reward (native
> size shrinkage after inline/not inline, using ir2native). To collect the
> sequences, we hook the logging infrastructure into LLVM Inliner that is
> able to produce logs after the inline optimization pass.
>
> RL - Model training: We use DQN (Deep Q-Network) to train our
> inlining-for-size ML policy. On a high level, the DQN algorithm trains a
> neural network to predict the value of different actions --- the DQN
policy
> then chooses to take the action with the highest predicted value. In our
> scenario, we have two actions: 1) inline; 2) not inline, so the neural
> network predicts the size reduction of these two actions based on
features,
> and then decides to conduct inlining if the neural network believes doing
> inlining leads to higher size reduction. After the training finishes, it
> produces a TensorFlow SavedModel that takes features as input and
generates
> inline decisions (whether to inline or not) as output.
>
> The choice of the features and reward are essential in model
training. The
> features are chosen with the consideration of being helpful in making the
> decision --- the input to the cost model is a good starting point.
Ideally,
> the reward in the inline-for-size problem is the native size shrinkage
> after inline/not inline. It is difficult to obtain precisely, so we
use the
> estimate as an alternative. This means that, for training, we also need a
> model (not necessarily machine learned) for estimating rewards.
>
> RL - Iterative data collection/model training: Reinforcement learning is
> ideally an iterative model/policy improvement process that: 1) the
trained
> model is deployed to the field to collect new data; 2) newly
collected data
> are used to update the model. Thus, we need to do iterative data
> collection/model training. To facilitate data collection (avoid complex
> build dependencies and time spent before/after the pass being
trained), we
> isolate out IR modules captured right before the optimization we are
> interested in, and iterate on them with opt. We bootstrap the
training from
> the current heuristic, and stop the process once we are happy with the
> outcome.
>
> IR2Native: We collect IR features (different from the ones used for
> inlining) for each function at the end of inlining, right before we
perform
> function simplification passes, and right after. This means we have
two IR
> ‘shapes’ of the same function, and we know no further inlinings will be
> performed, so whatever changes happen are based on that IR. We then
extract
> the native size at the end of compilation. Together, this data forms two
> records per function that can be used in supervised learning - the
features
> are those extracted from IR, and the label is the native size. Training
> IR2Native happens independently from the training of the inliner model.
> Training support for the current scope
>
> The initial change includes the logging mechanism, an ir2native model
> trained for x86-64, and the means to rapidly iterate over development ML
> models. For the components that will be included in subsequent
changes, the
> rest of this section describes the mechanisms we employed. These
components
> are detailed further below.
>
> To build LLVM with the ML policy in ‘Dev’ mode, we need the tensorflow C
> API library <https://www.tensorflow.org/install/lang_c>. We then
configure
> the build:
>
> cmake .. -DLLVM_USE_ML_POLICY=Dev \
>
> -DLLVM_TF_C_LIB=<path to unarchived package> \
>
> {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow shared
> library over, if it’s not on LD_LIBRARY_PATH}
>
>
> To extract IR right before inlining, we hacked on top of the ThinLTO
> infrastructure, interrupting its pre-link pipeline right before inlining.
> This means clang produces binary IR files captured at that stage. We then
> built a large target, obtaining a corpus of ~25K modules. We intend to
> provide a clean mechanism in a subsequent change.
>
> To obtain features/labels for training this “IR to Native Size” model, we
> had to make some changes to the AsmPrinter (to get real native sizes) and
> llvm-readobj, as well as add an analysis pass for extracting the IR
> features for this model. We plan on upstreaming these changes
subsequently.
>
> Finally, the infrastructure driving the policy training is currently
built
> on proprietary APIs, as it benefits from a distributed computing
> infrastructure. We are currently evaluating options for open sourcing it.
> In the meantime, we are presenting the high level implementation details.
>
> To train a new model, the infrastructure performs 2 steps: extracting the
> logs, and using them in a training algorithm.
>
> Log extraction is highly parallelizable: for each IR module in the
training
> corpus, we need to run opt once (or a few times, when we explore
> improvements). Specifically, each run is this invocation:
>
> opt -passes=scc-oz-module-inliner -ml-inliner-ir2native-model=<path to
> ir2native> -training-log=<path to training log output> -enable-ml-inliner
> -mandatory-inlinings-first -o <output> <module.o>
>
> Then collect the logs, and pass them to the next step.
>
>
> Experimental results
>
> Experimental results are available as follows:
>
>
> -
>
> SPEC2006
>
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true>
> binary sizes (-Oz) and ‘test run’ scores.
> -
>
> Size report
>
<https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true>
> from some internal benchmarks and binaries, including opt and clang
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list