[llvm-dev] RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM

Sun Apr 12 09:30:01 PDT 2020

As suggested, I froze the shared doc, and pasted the comments made there
below.

On Thu, Apr 9, 2020 at 10:25 AM Xinliang David Li <xinliangli at gmail.com>
wrote:

>
>
> On Thu, Apr 9, 2020 at 10:01 AM Mircea Trofin <mtrofin at google.com> wrote:
>
>> Sorry, I wasn't aware of that.
>>
>> I can make the google doc view-only, keeping the current comments.
>>
>
>
> Copy the discussion in the doc here ? :)
>
>
>> I'll wait a bit (few hrs) to see if there's any pushback to that.
>>
>> On Thu, Apr 9, 2020 at 9:57 AM Xinliang David Li <xinliangli at gmail.com>
>> wrote:
>>
>>> One suggestion : should we consolidate the discussion into the main
>>> thread? I know some folks are not willing to comment in Google docs.
>>>
>>> David
>>>
>>> On Wed, Apr 8, 2020 at 7:06 PM Mircea Trofin via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> +Yundi Qian <yundi at google.com> +Eugene Brevdo <ebrevdo at google.com> ,
>>>> our team members from the ML side.
>>>>
>>>> To avoid formatting issues, here is a link to the RFC
>>>> <https://docs.google.com/document/d/1BoSGQlmgAh-yUZMn4sCDoWuY6KWed2tV58P4_472mDE/edit?usp=sharing>,
>>>> open to comments.
>>>>
>>>> Thanks!
>>>>
>>>> On Wed, Apr 8, 2020 at 2:34 PM Mircea Trofin <mtrofin at google.com>
>>>> wrote:
>>>>
>>>>> Unfortunately I needed to update the links. Here they are, hopefully
>>>>> these work correctly.
>>>>>
>>>>> SPEC2006
>>>>> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=987260531&single=true>
>>>>>
>>>>> Internal benchmarks, clang, opt
>>>>> <https://docs.google.com/spreadsheets/d/e/2PACX-1vQNAcTDfyQvh6Jq7IRdCvK_fuluUFrzrsGL_75Ile29hX3caBSfT6_jHulxeCJ5MXIHp5SB--A_goEi/pubhtml?gid=0&single=true>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Apr 8, 2020 at 2:23 PM Mircea Trofin <mtrofin at google.com>
>>>>> wrote:
>>>>>
>>>>>> It turns out it's me, sorry. Let me see how I can sort this out. In
>>>>>> the meantime, here is the csv:
>>>>>>
>>>>>> SPEC2006 data:
>>>>>>
>>>>>> binary,base -Oz size,ML -Oz size,ML size shrink by,,perf: base -Oz
>>>>>> scores,perf: ML -Oz scores,ML improvement by
>>>>>> 400.perlbench,2054200,2086776,-1.59%,,2.9,2.9,0.00%
>>>>>> 401.bzip2,1129976,1095544,3.05%,,6.4,6.2,-3.13%
>>>>>> 403.gcc,4078488,4130840,-1.28%,,11.6,11.7,0.86%
>>>>>> 429.mcf,1089368,1054552,3.20%,,2.3,2.4,4.35%
>>>>>> 433.milc,1161336,1129592,2.73%,,0.5,0.5,0.00%
>>>>>> 444.namd,1324824,1290968,2.56%,,4.4,4.3,-2.27%
>>>>>> 445.gobmk,5003096,4992472,0.21%,,3.9,4.1,5.13%
>>>>>> 447.dealII,1003376,975024,2.83%,,162.1,197.8,22.02%
>>>>>> 450.soplex,1359416,1326008,2.46%,,0.1,0.1,0.00%
>>>>>> 453.povray,1921432,1952280,-1.61%,,32.9,32.7,-0.61%
>>>>>> 456.hmmer,1210744,1184632,2.16%,,1.5,1.5,0.00%
>>>>>> 458.sjeng,1185976,1155320,2.58%,,812.9,840.2,3.36%
>>>>>> 462.libquantum,1101144,1066712,3.13%,,12.3,12.2,-0.81%
>>>>>> 464.h264ref,1557176,1593272,-2.32%,,0.3,0.3,0.00%
>>>>>> 470.lbm,1091352,1056664,3.18%,,2.3,2.5,8.70%
>>>>>> 471.omnetpp,1045496,1046664,-0.11%,,27.9,28.4,1.79%
>>>>>> 473.astar,1106680,1071992,3.13%,,2.4,2.6,8.33%
>>>>>> 482.sphinx3,1216600,1182680,2.79%,,16.5,16.3,-1.21%
>>>>>> 483.xalancbmk,4666936,4669112,-0.05%,,9.7,9.6,-1.03%
>>>>>> ,,,,,,,
>>>>>> SIZE SUM,34307616,34061104,0.72%,TOTAL SCORES,5.7,5.8,1.75%
>>>>>>
>>>>>>
>>>>>> various benchmarks:
>>>>>>
>>>>>> binary,base -Oz size,ML -Oz size,ML shrink over base
>>>>>> llvm:opt,66318624,62124256,6.32%
>>>>>> math_1,6181552,5884144,4.81%
>>>>>> event_management_1,4998728,4802696,3.92%
>>>>>> llvm:lcalsBRaw,1270168,1222168,3.78%
>>>>>> llvm:lcalsBLambda,1270232,1222232,3.78%
>>>>>> llvm:lcalsARaw,1276248,1228248,3.76%
>>>>>> llvm:lcalsALambda,1276440,1228504,3.76%
>>>>>> llvm:lcalsCRaw,1278936,1231000,3.75%
>>>>>> llvm:lcalsCLambda,1279064,1231256,3.74%
>>>>>> llvm:Blur,1236888,1191192,3.69%
>>>>>> image_processing_diffusion,1236120,1190488,3.69%
>>>>>> llvm:Interpolation,1237208,1191576,3.69%
>>>>>> llvm:harris,1237208,1191768,3.67%
>>>>>> llvm:Dither,1238232,1192792,3.67%
>>>>>> event_management_2,4741096,4568584,3.64%
>>>>>> hashing_1,5189360,5004176,3.57%
>>>>>> compression_2,5329392,5140496,3.54%
>>>>>> math_2,6072336,5858992,3.51%
>>>>>> crypto_1,4721576,4556008,3.51%
>>>>>> math_4,27720208,26755568,3.48%
>>>>>> math_3,27732496,26767920,3.48%
>>>>>> infra_1,30673232,29630464,3.40%
>>>>>> hashing_2,5834544,5637168,3.38%
>>>>>> image_processing_entropy,5854960,5657008,3.38%
>>>>>> hashing_3,5831088,5634288,3.38%
>>>>>> memory_management_1,4957960,4790632,3.37%
>>>>>> hasing_4,5816048,5619888,3.37%
>>>>>> data_storage_2,5852976,5655792,3.37%
>>>>>> compression_1,5971984,5771120,3.36%
>>>>>> arena_unittest,6672944,6453584,3.29%
>>>>>> data_storage_3,44113232,42710960,3.18%
>>>>>> data_storage_1,22858992,22132960,3.18%
>>>>>> datastructures_1,6362032,6161264,3.16%
>>>>>> infra_2,56828720,55053496,3.12%
>>>>>> datastructures_2,164558704,160790320,2.29%
>>>>>> llvm:clang,77027416,75672088,1.76%
>>>>>> benchmark_infra,20221424,19944432,1.37%
>>>>>> search_1,169475360,167645632,1.08%
>>>>>> protobuf_3,24071504,23895440,0.73%
>>>>>> protobuf_1,24071504,23895504,0.73%
>>>>>> protobuf_2,24071504,23895504,0.73%
>>>>>> protobuf_4,24071504,23895504,0.73%
>>>>>> protobuf_5,24071504,23895504,0.73%
>>>>>> protobuf_6,23982800,23809424,0.72%
>>>>>>
>>>>>> On Wed, Apr 8, 2020 at 2:11 PM Johannes Doerfert <
>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Cool!
>>>>>>>
>>>>>>> I skipped to the end and tried to access the gdoc and the spreadsheet
>>>>>>> but it did tell me I need permissions. Can you make them accessible
>>>>>>> or
>>>>>>> am I the problem?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>    Johannes
>>>>>>>
>>>>>>> On 4/8/20 4:04 PM, Mircea Trofin via llvm-dev wrote:
>>>>>>>  > TL;DR; We can improve compiler optimizations driven by heuristics
>>>>>>> by
>>>>>>>  > replacing those heuristics with machine-learned policies (ML
>>>>>>> models).
>>>>>>>  > Policies are trained offline and ship as part of the compiler.
>>>>>>> Determinism
>>>>>>>  > is maintained because models are fixed when the compiler is
>>>>>>> operating in
>>>>>>>  > production. Fine-tuning or regressions may be handled by
>>>>>>> incorporating the
>>>>>>>  > interesting cases in the ML training set, retraining the
>>>>>>> compiler, and
>>>>>>>  > redeploying it.
>>>>>>>  >
>>>>>>>  > For a first milestone, we chose inlining for size (-Oz) on
>>>>>>> X86-64. We
>>>>>>> were
>>>>>>>
>>>>>>
Aditya Kumar
Can we get more specific details of the device, like memory, #processors.
This will help figure out if I can use this in -flto.

Mircea Trofin
How would those impact inlining for size - or do you mean wrt training
setup?

Aditya Kumar
I meant during the actual inlining.

Mircea Trofin
Oh, it's off the shelf x86_64 - see the "Inference opt-in mechanism"
section, the runtime requirements are not particularly significant (I
assume that is the concern?)

Aditya Kumar
off the shelf is too generic i think. I have different x86-64 machines all
with different RAM and #processors. To compile larger binaries I only use
machines with >64G RAM, and 64 cores to get compilation time down to few
hours. I am trying to estimate if this optimization in -flto would work for
my use case.

Mircea Trofin
The machine I'be been using for compiler runs is a Skylake machine with 36
cores and 192GB.

But just to make sure we are talking about the same thing: if you are
interested in using a trained model (e.g. the one in the patch) as-is, then
the overhead of the method is low: the code is single-threaded, the extra
memory taken by the model is ~300KB 'fixed cost', there are no new runtime
dependencies to clang, and the execution overhead is a few percent. To
stress the point - there is no adaptive learning at runtime.

So the capacities you mentioned should be more than sufficient - caveat
though: we haven't tested this pre-trained model in a -flto scenario, nor
was the IR used for training coming from a lto scenario. We would be
interested in the results, of course!

If you are interested in training a model 'from scratch', currently the
exploration part runs in the cloud, as I mentioned. I think that it could
run on my machine in about 1 hr - we'll be able to have precise numbers
when we port it out of the internal APIs we currently use.

The learning part after that ran on my machine in ~30 minutes.

Aditya Kumar
Thanks a lot for the explanation. Really nice to know that the overhead is
fixed!

>  > able to train an ML model to produce binaries 1.5-6% smaller than -Oz of
>>>>>>>  > tip-of-tree. The trained model appears to generalize well over a
>>>>>>> diverse
>>>>>>>  > set of binaries. Compile time is increased marginally (under 5%).
>>>>>>> The
>>>>>>> model
>>>>>>>  > also happens to produce slightly better - performing code under
>>>>>>> SPEC2006 -
>>>>>>>  > total score improvement by 1.75%. As we only wanted to verify
>>>>>>> there is no
>>>>>>>  > significant regression in SPEC, and given the milestone goals, we
>>>>>>> haven’t
>>>>>>>  > dug any deeper into the speed results.
>>>>>>>  >
>>>>>>>  > We see these results as promising, and as a reasonable point for
>>>>>>>  > contributing our current work as a build-time opt-in to LLVM to
>>>>>>> benefit the
>>>>>>>  > community, in the hope of fostering collaboration and learning
>>>>>>> from the
>>>>>>>  > community’s feedback, as we try to better understand the
>>>>>>> trade-offs
>>>>>>> such an
>>>>>>>  > approach entails, and as we work on expanding the depth and
>>>>>>> breadth of
>>>>>>>  > applying these techniques to compiler optimization problems.
>>>>>>>
>>>>>>
Johannes Doerfert
I think this is an excellent point.

I would also mention that we need to investigate how we can integrate ML
efforts in a structured way into LLVM. One very important aspect for us is
that we want to share, not duplicating infrastructure when we adopt more
use cases. To adopt more use cases it has to be easy to try things which
usually happens if infrastructure is in place and generic enough, ...

Put slightly differently:
Based on what you have in your patch, I think we need to distill generic
data structures (like in llvm/ADT) and tooling (like in llvm/Support and
llvm/XXXXX/Utils) that enable structured and easy reuse of the work you
have put in this.

I hope that one or two of the GSoC students that will look into
"ML-related" topics are interested in building infrastructure as part of
their project. Also, reach out of you want to co-mentor these students.

Aditya Kumar
+1 on integrating ML infra in a way that makes future extensions easier.

Mircea Trofin
Agreed. I'm confident that commonalities would naturally emerge once a few
scenarios are analyzed in depth.

>  >
>>>>>>>  > https://reviews.llvm.org/D77752
>>>>>>>  > Motivation
>>>>>>>  >
>>>>>>>  > Optimization problems, such as inlining or register allocation,
>>>>>>> are hard:
>>>>>>>  > we don’t know of algorithms that are guaranteed to produce the
>>>>>>> optimum
>>>>>>>  > solution in all cases in a timely fashion. Instead, we use
>>>>>>> heuristics:
>>>>>>>  > algorithms that we expect to produce some approximation of the
>>>>>>> optimum,
>>>>>>>  > with some expected degree of generality, in a practical amount of
>>>>>>> time.
>>>>>>>  >
>>>>>>>  > Heuristics have some common characteristics. Taking inlining as a
>>>>>>> case
>>>>>>>  > study, it traverses the problem space in some way (bottom-up
>>>>>>> traversal of
>>>>>>>  > the SCC graph), extracts some properties (let’s call them
>>>>>>> “features”) of
>>>>>>>  > the program being optimized, and combine them with some weights
>>>>>>> (“tune”),
>>>>>>>  > to produce a cost (InlineCost), which allows for trade-off
>>>>>>> analysis. We
>>>>>>>  > validate the effectiveness of a heuristic over some accepted set
>>>>>>> of
>>>>>>>  > benchmarks. Over time, we react to regressions or pathological
>>>>>>> cases
>>>>>>>  > observed in the field, by manually analyzing such cases, figuring
>>>>>>> out an
>>>>>>>  > enhancement to the heuristic, and then re-validating over that
>>>>>>> set of
>>>>>>>  > benchmarks (maybe augmented by adding the newly found cases).
>>>>>>>  >
>>>>>>>  > Because heuristics are code that needs to be maintained, there is
>>>>>>> pressure
>>>>>>>  > to reduce complexity: adding more features means we need to
>>>>>>> reason about
>>>>>>>  > the interactions between the old and new features, which scales
>>>>>>>  > combinatorially. Re-tuning because of the new features adds a
>>>>>>> similar
>>>>>>> kind
>>>>>>>  > of complexity. Potentially, we miss out on optimization
>>>>>>> improvements as a
>>>>>>>  > result.
>>>>>>>  >
>>>>>>>  > Because tuning is manual, there is pressure to keep the number of
>>>>>>>  > benchmarks that can be studied in depth to a humanly-manageable
>>>>>>> size,
>>>>>>> which
>>>>>>>  > potentially affects the generality of a heuristic or heuristic
>>>>>>> tuning.
>>>>>>>  >
>>>>>>>  > The main advantage of manual heuristics is arguably their
>>>>>>> relatively
>>>>>>> lower
>>>>>>>  > overhead: no additional dependencies and more transparent to
>>>>>>> human
>>>>>>> analysis
>>>>>>>  > and improvement.
>>>>>>>  >
>>>>>>>  > Machine learning, in particular reinforcement learning, can
>>>>>>> address the
>>>>>>>  > tensions found in manual heuristics: once features are extracted
>>>>>>> from the
>>>>>>>  > program, the way they are combined and tuned can easily be scaled
>>>>>>> up
>>>>>>>  > through automation, improving effectiveness and generality. A
>>>>>>> major
>>>>>>>  > drawback, at least at this point in time, of machine learning, is
>>>>>>> that we
>>>>>>>  > don’t yet have a fully developed systematic approach for
>>>>>>> improving policy
>>>>>>>  > effectiveness.
>>>>>>>  > High level design
>>>>>>>  >
>>>>>>>  > We identify two scenarios for a compiler using ML policies:
>>>>>>> development and
>>>>>>>  > release.
>>>>>>>  >
>>>>>>>  > The release scenario is equivalent to the regular compilation we
>>>>>>> have
>>>>>>> today
>>>>>>>  > - the only difference is that it uses a pre-trained model
>>>>>>> (trained in the
>>>>>>>  > development scenario beforehand) to make decisions instead of the
>>>>>>>  > heuristics. Determinism is guaranteed since the model in the
>>>>>>> release
>>>>>>>  > scenario is fixed. We imagine teams wishing to fine tune the
>>>>>>> effectiveness
>>>>>>>  > of the optimization to their scenarios would train a different
>>>>>>> model.
>>>>>>>  >
>>>>>>>  > The decision previously evaluated using a human-crafted heuristic
>>>>>>> is
>>>>>>>  > optionally replaced by:
>>>>>>>  >
>>>>>>>  >    -
>>>>>>>  >
>>>>>>>  >    a compiler-specific component, extracting features from IR
>>>>>>> (i.e. a
>>>>>>>  >    vector of values)
>>>>>>>  >    -
>>>>>>>  >
>>>>>>>  >    an evaluation of an ML model using those features, to obtain a
>>>>>>> result.
>>>>>>>  >    In ML nomenclature, this is referred to using the model for
>>>>>>> inference (as
>>>>>>>  >    opposed to training it)
>>>>>>>  >
>>>>>>>  > For example, when we replaced the decision of whether to inline a
>>>>>>> callsite,
>>>>>>>  > the ML model produces a boolean (inline/don’t inline) based on a
>>>>>>> features
>>>>>>>  > vector characterizing the call site and some broader module-wide
>>>>>>> context.
>>>>>>>  >
>>>>>>>  > Training/development is more complicated, and happens offline -
>>>>>>> akin to
>>>>>>>  > how, today, attempts to improve an optimizing pass also happen
>>>>>>> offline. A
>>>>>>>  > description of the high level design and the specifics we used
>>>>>>> for the
>>>>>>>  > current scope are given in Appendix.
>>>>>>>  > Current Scope
>>>>>>>  >
>>>>>>>  > The goal of our first milestone was to evaluate end to end an
>>>>>>> integration
>>>>>>>  > of ML with LLVM, and get a first promising result. To that end,
>>>>>>> we chose
>>>>>>>  > inlining for size (-Oz) as a stepping stone, as we perceived it
>>>>>>> to be
>>>>>>> more
>>>>>>>  > likely to require a simpler evaluation setup than
>>>>>>> performance-oriented
>>>>>>>  > optimizations might. At this point, we only train whether a call
>>>>>>> site may
>>>>>>>  > be inlined or not, leaving the SCC traversal order as-is.
>>>>>>>  >
>>>>>>>  > We are proposing an initial change demonstrating the inference
>>>>>>> mechanism
>>>>>>>  > using a pre-trained model, as a build-time opt-in to llvm. The
>>>>>>> compiler
>>>>>>>  > components needed to perform training are also included in this
>>>>>>> first
>>>>>>>  > change. Subsequent changes would include more training-related
>>>>>>> components.
>>>>>>>  >
>>>>>>>  > At a high level, the changes we are proposing consist of:
>>>>>>>  >
>>>>>>>  >    1.
>>>>>>>  >
>>>>>>>  >    a new module analysis pass, InliningAdvisor. By default, its
>>>>>>>  >    implementation does nothing.
>>>>>>>  >    2.
>>>>>>>  >
>>>>>>>  >    minimal hooks into Inliner.cpp.
>>>>>>>  >    3.
>>>>>>>  >
>>>>>>>  >    the implementation of InliningAdvisor, activated when we
>>>>>>> opt-in
>>>>>>> ML. This
>>>>>>>  >    is available in Analysis/ML, together with:
>>>>>>>  >    1.
>>>>>>>  >
>>>>>>>  >       Rel/Dev specific ML model handing, also under Analysis/ML
>>>>>>>  >       2.
>>>>>>>  >
>>>>>>>  >       a pre-trained model for inlining for size
>>>>>>>  >       (Analysis/ML/models/inlining)
>>>>>>>  >       3.
>>>>>>>  >
>>>>>>>  >       a pre-trained model for predicting native size from IR
>>>>>>>  >       (Analysis/ML/models/ir_2_native_x86_64), used in Dev mode
>>>>>>> only.
>>>>>>>  >       4.
>>>>>>>  >
>>>>>>>  >    Some refactorings in PassBuilder, to allow opting into running
>>>>>>> mandatory
>>>>>>>  >    inlining first - some compilation speedup for the ML case,
>>>>>>> minimal,
>>>>>>>  >    noise-like size effect. Also simplifies testing (these would
>>>>>>> be
>>>>>>> introduced
>>>>>>>  >    as a preliminary patch).
>>>>>>>  >
>>>>>>>  > We discuss ‘rel’ mode here, and ‘dev’ mode in the Appendix, as it
>>>>>>> is more
>>>>>>>  > involved.
>>>>>>>  > Inference Opt-In Mechanism
>>>>>>>  >
>>>>>>>  > The feature is primarily controlled by the cmake flag
>>>>>>>  > LLVM_USE_ML_POLICY={“Rel”|”Dev”}. Each has different
>>>>>>> dependencies. The
>>>>>>>  > “Rel”ease case requires specifying the location of the pip
>>>>>>> tensorflow
>>>>>>>  > package (currently, that’s tf_nightly, and it should soon be
>>>>>>> available in
>>>>>>>  > tensorflow)
>>>>>>>  >
>>>>>>>  > To opt in the ‘Rel’ case:
>>>>>>>  >
>>>>>>>  >    1.
>>>>>>>  >
>>>>>>>  >    install tensorflow pip package
>>>>>>>  >
>>>>>>>  > pip3 install tf_nightly --user
>>>>>>>  >
>>>>>>>  >    1.
>>>>>>>  >
>>>>>>>  >    configure llvm build
>>>>>>>  >
>>>>>>>  > cmake ../llvm -DLLVM_USE_ML_POLICY=Rel \
>>>>>>>  >
>>>>>>>  >
>>>>>>> -DLLVM_TF_AOT_RUNTIME=~/.local/lib/python3.7/site-packages/tensorflow \
>>>>>>>  >
>>>>>>>  > {-DLLVM_TF_AOT_COMPILER=<path to saved_model_cli tool, if needed
>>>>>>> - it’s
>>>>>>>  > usually in the path>}
>>>>>>>  >
>>>>>>>  >
>>>>>>>  >
>>>>>>>  >    1.
>>>>>>>  >
>>>>>>>  >    build llvm as usual.
>>>>>>>  >    2.
>>>>>>>  >
>>>>>>>  >    pass -mllvm -enable-ml-inliner -mllvm
>>>>>>> -mandatory-inlinings-first to
>>>>>>>  >    clang.
>>>>>>>  >
>>>>>>>  > Details
>>>>>>>  >
>>>>>>>  > The ML model is captured as a TensorFlow ‘saved model’. When
>>>>>>> building
>>>>>>> llvm,
>>>>>>>  > we use TensorFlow’s  XLA native compiler (saved_model_cli) to
>>>>>>> compile the
>>>>>>>  > saved model into a native static library and a header file.
>>>>>>> Insofar
>>>>>>> as LLVM
>>>>>>>  > is concerned, there are minimal additional runtime requirements,
>>>>>>> packaged
>>>>>>>  > as source within the pip package: C++ wrappers around the
>>>>>>> compiled model.
>>>>>>>  > These will also be statically linked in the LLVM target. The
>>>>>>> compiled
>>>>>>> code
>>>>>>>  > is otherwise just a complex arithmetical computation, with no
>>>>>>> special
>>>>>>>  > requirements - it is single threaded and runs natively on the
>>>>>>> targeted
>>>>>>>  > architecture. Together with the aforementioned runtime
>>>>>>> dependencies, it
>>>>>>>  > adds ~115KB to the clang binary (0.08% increase)
>>>>>>>  >
>>>>>>>  > Runtime-wise, we observed a ~10% increase in the time spent in
>>>>>>> the
>>>>>>> inliner,
>>>>>>>  > for a large (33MB) binary IR module; inlining typically consumes
>>>>>>> ~10-15% of
>>>>>>>  > total compilation time, so the overall compile time overhead of
>>>>>>> the
>>>>>>>  > approach is arguably negligible. This cost is almost in entirety
>>>>>>>  > attributable to feature extraction.
>>>>>>>  >
>>>>>>>  > Memory-wise, the precompiled model has a fixed size buffer for
>>>>>>> its
>>>>>>> inputs,
>>>>>>>  > and performs a fixed amount of computations, so the memory
>>>>>>> overhead
>>>>>>>  > inherent to our approach is independent from the program being
>>>>>>> optimized.
>>>>>>>  > Using a small example to avoid effects such as memory use
>>>>>>> differences due
>>>>>>>  > to different inlinings, we observed an 300KB increase in the
>>>>>>> maximum
>>>>>>>  > resident size.
>>>>>>>
>>>>>>
Aditya Kumar
what was the original resident set?

Mircea Trofin
~46MB (maximum) - but, as mentioned, the overhead is dependent on internal
ML structure (fixed) and ML input size (i.e. tensor containing features -
also fixed size). So the 300KB is 'fixed cost' and shouldn't vary program
to program.

That being said, there would be other memory utilization variations,
depending on impact of inlining decisions (i.e. larger/smaller IR)

>  >
>>>>>>>  > A table showing effect on -Oz compiled binaries’ size is given in
>>>>>>> Appendix.
>>>>>>>  > Next directions
>>>>>>>  >
>>>>>>>  > Our next milestone has two main high level goals: detailing a
>>>>>>> systematic
>>>>>>>  > approach to driving policy effectiveness; and exploring in depth
>>>>>>> the type
>>>>>>>  > of features and training algorithms most appropriate for compiler
>>>>>>> problems,
>>>>>>>  > or at least problems like inlining. For the latter, we expect
>>>>>>> embedding
>>>>>>>  > more of the call graph structure to play an important role, as
>>>>>>> well as,
>>>>>>>  > potentially, delegating the problem space traversal to the ML
>>>>>>> model.
>>>>>>>  >
>>>>>>>  > We plan to include inlining for speed as part of our work on
>>>>>>> these goals.
>>>>>>>  > AppendixTraining - High Level
>>>>>>>  >
>>>>>>>  > We use Reinforcement Learning (RL) to train the Inline-for-size
>>>>>>> model. At a
>>>>>>>  > high level, it is composed of 3 parts: training data collection,
>>>>>>> model
>>>>>>>  > training, and iterative data collection/model training. We use
>>>>>>> TensorFlow
>>>>>>>  > as our ML framework.
>>>>>>>  >
>>>>>>>  > Related, we also needed to learn a separate model to evaluate the
>>>>>>> native
>>>>>>>  > size of a function, given its IR, in order to calculate a more
>>>>>>> precise
>>>>>>>  > reward for the reinforcement learning algorithm (“IR2Native”). We
>>>>>>> evaluated
>>>>>>>  > ‘just counting IR’ and TargetTransformInfo, but they appeared to
>>>>>>> provide
>>>>>>>  > too noisy of a signal for the reward, insofar as the RL training
>>>>>>> algorithm
>>>>>>>  > for the inlining model was concerned. This model is only used
>>>>>>> during
>>>>>>>  > training.
>>>>>>>  >
>>>>>>>  > RL - Training data collection: the training data we need to feed
>>>>>>> into a
>>>>>>>  > reinforcement learning algorithm are sequences consisting of:
>>>>>>> state
>>>>>>> of the
>>>>>>>  > problem (i.e. features); action (inline/not inline), and reward
>>>>>>> (native
>>>>>>>  > size shrinkage after inline/not inline, using ir2native). To
>>>>>>> collect the
>>>>>>>  > sequences, we hook the logging infrastructure into LLVM Inliner
>>>>>>> that is
>>>>>>>  > able to produce logs after the inline optimization pass.
>>>>>>>  >
>>>>>>>  > RL - Model training: We use DQN (Deep Q-Network) to train our
>>>>>>>  > inlining-for-size ML policy. On a high level, the DQN algorithm
>>>>>>> trains a
>>>>>>>  > neural network to predict the value of different actions --- the
>>>>>>> DQN
>>>>>>> policy
>>>>>>>  > then chooses to take the action with the highest predicted value.
>>>>>>> In our
>>>>>>>  > scenario, we have two actions: 1) inline; 2) not inline, so the
>>>>>>> neural
>>>>>>>  > network predicts the size reduction of these two actions based on
>>>>>>> features,
>>>>>>>  > and then decides to conduct inlining if the neural network
>>>>>>> believes doing
>>>>>>>  > inlining leads to higher size reduction. After the training
>>>>>>> finishes, it
>>>>>>>  > produces a TensorFlow SavedModel that takes features as input and
>>>>>>> generates
>>>>>>>  > inline decisions (whether to inline or not) as output.
>>>>>>>  >
>>>>>>>  > The choice of the features and reward are essential in model
>>>>>>> training. The
>>>>>>>  > features are chosen with the consideration of being helpful in
>>>>>>> making the
>>>>>>>  > decision --- the input to the cost model is a good starting
>>>>>>> point.
>>>>>>> Ideally,
>>>>>>>  > the reward in the inline-for-size problem is the native size
>>>>>>> shrinkage
>>>>>>>  > after inline/not inline. It is difficult to obtain precisely, so
>>>>>>> we
>>>>>>> use the
>>>>>>>  > estimate as an alternative. This means that, for training, we
>>>>>>> also need a
>>>>>>>  > model (not necessarily machine learned) for estimating rewards.
>>>>>>>  >
>>>>>>>  > RL - Iterative data collection/model training: Reinforcement
>>>>>>> learning is
>>>>>>>  > ideally an iterative model/policy improvement process that: 1)
>>>>>>> the
>>>>>>> trained
>>>>>>>  > model is deployed to the field to collect new data; 2) newly
>>>>>>> collected data
>>>>>>>  > are used to update the model. Thus, we need to do iterative data
>>>>>>>  > collection/model training. To facilitate data collection (avoid
>>>>>>> complex
>>>>>>>  > build dependencies and time spent before/after the pass being
>>>>>>> trained), we
>>>>>>>  > isolate out IR modules captured right before the optimization we
>>>>>>> are
>>>>>>>  > interested in, and iterate on them with opt. We bootstrap the
>>>>>>> training from
>>>>>>>  > the current heuristic, and stop the process once we are happy
>>>>>>> with the
>>>>>>>  > outcome.
>>>>>>>  >
>>>>>>>  > IR2Native: We collect IR features (different from the ones used
>>>>>>> for
>>>>>>>  > inlining) for each function at the end of inlining, right before
>>>>>>> we
>>>>>>> perform
>>>>>>>  > function simplification passes, and right after. This means we
>>>>>>> have
>>>>>>> two IR
>>>>>>>  > ‘shapes’ of the same function, and we know no further inlinings
>>>>>>> will be
>>>>>>>  > performed, so whatever changes happen are based on that IR. We
>>>>>>> then
>>>>>>> extract
>>>>>>>  > the native size at the end of compilation. Together, this data
>>>>>>> forms two
>>>>>>>  > records per function that can be used in supervised learning -
>>>>>>> the
>>>>>>> features
>>>>>>>  > are those extracted from IR, and the label is the native size.
>>>>>>> Training
>>>>>>>  > IR2Native happens independently from the training of the inliner
>>>>>>> model.
>>>>>>>  > Training support for the current scope
>>>>>>>  >
>>>>>>>  > The initial change includes the logging mechanism, an ir2native
>>>>>>> model
>>>>>>>  > trained for x86-64, and the means to rapidly iterate over
>>>>>>> development ML
>>>>>>>  > models. For the components that will be included in subsequent
>>>>>>> changes, the
>>>>>>>  > rest of this section describes the mechanisms we employed. These
>>>>>>> components
>>>>>>>  > are detailed further below.
>>>>>>>  >
>>>>>>>  > To build LLVM with the ML policy in ‘Dev’ mode, we need the
>>>>>>> tensorflow C
>>>>>>>  > API library <https://www.tensorflow.org/install/lang_c>. We then
>>>>>>> configure
>>>>>>>  > the build:
>>>>>>>  >
>>>>>>>  > cmake .. -DLLVM_USE_ML_POLICY=Dev \
>>>>>>>  >
>>>>>>>  > -DLLVM_TF_C_LIB=<path to unarchived package> \
>>>>>>>  >
>>>>>>>  > {-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=True, to copy tensorflow
>>>>>>> shared
>>>>>>>  > library over, if it’s not on LD_LIBRARY_PATH}
>>>>>>>  >
>>>>>>>  >
>>>>>>>  > To extract IR right before inlining, we hacked on top of the
>>>>>>> ThinLTO
>>>>>>>  > infrastructure, interrupting its pre-link pipeline right before
>>>>>>> inlining.
>>>>>>>  > This means clang produces binary IR files captured at that stage.
>>>>>>> We then
>>>>>>>  > built a large target, obtaining a corpus of ~25K modules. We
>>>>>>> intend to
>>>>>>>  > provide a clean mechanism in a subsequent change.
>>>>>>>  >
>>>>>>>  > To obtain features/labels for training this “IR to Native Size”
>>>>>>> model, we
>>>>>>>  > had to make some changes to the AsmPrinter (to get real native
>>>>>>> sizes) and
>>>>>>>  > llvm-readobj, as well as add an analysis pass for extracting the
>>>>>>> IR
>>>>>>>  > features for this model. We plan on upstreaming these changes
>>>>>>> subsequently.
>>>>>>>  >
>>>>>>>  > Finally, the infrastructure driving the policy training is
>>>>>>> currently
>>>>>>> built
>>>>>>>  > on proprietary APIs, as it benefits from a distributed computing
>>>>>>>  > infrastructure. We are currently evaluating options for open
>>>>>>> sourcing it.
>>>>>>>  > In the meantime, we are presenting the high level implementation
>>>>>>> details.
>>>>>>>  >
>>>>>>>  > To train a new model, the infrastructure performs 2 steps:
>>>>>>> extracting the
>>>>>>>  > logs, and using them in a training algorithm.
>>>>>>>  >
>>>>>>>  > Log extraction is highly parallelizable: for each IR module in
>>>>>>> the
>>>>>>> training
>>>>>>>  > corpus, we need to run opt once (or a few times, when we explore
>>>>>>>  > improvements). Specifically, each run is this invocation:
>>>>>>>
>>>>>>
Aditya Kumar
it will be useful to know how many iterations were required to converge to
the gains we see. maybe a graph with #training iterations vs. size
reductions will be very helpful. I'm curious because it would help to know
the minimum #iterations to get to 95-95% of code-size reductions.

Yundi Qian
With the current feature set and reward in the inline-for-size case, it
only takes us one iteration to achieve the model (we don't find more
iterations helpful in further improving the model performance). We highly
suspect that the feature set/reward definition is not good enough. One of
our next step is to improve them so that more iterations can further help
improve the model performance.

>  >
>>>>>>>  > opt -passes=scc-oz-module-inliner
>>>>>>> -ml-inliner-ir2native-model=<path to
>>>>>>>  > ir2native> -training-log=<path to training log output>
>>>>>>> -enable-ml-inliner
>>>>>>>  > -mandatory-inlinings-first -o <output> <module.o>
>>>>>>>  >
>>>>>>>  > Then collect the logs, and pass them to the next step.
>>>>>>>  >
>>>>>>>  >
>>>>>>>  > Experimental results
>>>>>>>
>>>>>>
Aditya Kumar
Do we have an estimate of memory pressure during the compilation?

Mircea Trofin
Yes, ~300KB for the 'rel' case. See "Inference Opt-In Mechanism", paragraph
above last (in that section)

>  >
>>>>>>>  > Experimental results are available as follows:
>>>>>>>  >
>>>>>>>  >
>>>>>>>  >    -
>>>>>>>  >
>>>>>>>  >    SPEC2006
>>>>>>>  >
>>>>>>> <
>>>>>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=1870752756&single=true
>>>>>>> >
>>>>>>>  >    binary sizes (-Oz) and ‘test run’ scores.
>>>>>>>  >    -
>>>>>>>  >
>>>>>>>  >    Size report
>>>>>>>  >
>>>>>>> <
>>>>>>> https://docs.google.com/spreadsheets/d/e/2PACX-1vQv0bAsUlgnG114zMYy_zKR6x-lTjcXVNt7VEeSwq2-pDr5oTxdASCscPRRg6L7iYLu2AVJ44G2xEkp/pubhtml?gid=935959003&single=true
>>>>>>> >
>>>>>>>  >    from some internal benchmarks and binaries, including opt and
>>>>>>> clang
>>>>>>>  >
>>>>>>>  >
>>>>>>>  > _______________________________________________
>>>>>>>  > LLVM Developers mailing list
>>>>>>>  > llvm-dev at lists.llvm.org
>>>>>>>  > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200412/de49df78/attachment-0001.html>