[llvm-dev] RFC: Comprehensive Static Instrumentation

Fri Jun 17 13:25:36 PDT 2016

On Thu, Jun 16, 2016 at 10:16 PM, Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Thu, Jun 16, 2016 at 3:27 PM, Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi TB,
>>
>> Thanks for you answer.
>>
>> On Jun 16, 2016, at 2:50 PM, TB Schardl <neboat at mit.edu> wrote:
>>
>> Hey Mehdi,
>>
>> Thank you for your comments.  I've CC'd the CSI mailing list with your
>> comments and put my responses inline.  Please let me know any other
>> questions you have.
>>
>> Cheers,
>> TB
>>
>> On Thu, Jun 16, 2016 at 3:48 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Jun 16, 2016, at 9:01 AM, TB Schardl via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hey LLVM-dev,
>>>
>>> We propose to build the CSI framework to provide a comprehensive suite
>>> of compiler-inserted instrumentation hooks that dynamic-analysis tools can
>>> use to observe and investigate program runtime behavior.  Traditionally,
>>> tools based on compiler instrumentation would each separately modify the
>>> compiler to insert their own instrumentation.  In contrast, CSI inserts a
>>> standard collection of instrumentation hooks into the program-under-test.
>>> Each CSI-tool is implemented as a library that defines relevant hooks, and
>>> the remaining hooks are "nulled" out and elided during link-time
>>> optimization (LTO), resulting in instrumented runtimes on par with custom
>>> instrumentation.  CSI allows many compiler-based tools to be written as
>>> simple libraries without modifying the compiler, greatly lowering the bar
>>> for
>>> developing dynamic-analysis tools.
>>>
>>> ================
>>> Motivation
>>> ================
>>>
>>> Key to understanding and improving the behavior of any system is
>>> visibility -- the ability to know what is going on inside the system.
>>> Various dynamic-analysis tools, such as race detectors, memory checkers,
>>> cache simulators, call-graph generators, code-coverage analyzers, and
>>> performance profilers, rely on compiler instrumentation to gain visibility
>>> into the program behaviors during execution.  With this approach, the tool
>>> writer modifies the compiler to insert instrumentation code into the
>>> program-under-test so that it can execute behind the scene while the
>>> program-under-test runs.  This approach, however, means that the
>>> development of new tools requires compiler work, which many potential tool
>>> writers are ill equipped to do, and thus raises the bar for building new
>>> and innovative tools.
>>>
>>> The goal of the CSI framework is to provide comprehensive static
>>> instrumentation through the compiler, in order to simplify the task of
>>> building efficient and effective platform-independent tools.  The CSI
>>> framework allows the tool writer to easily develop analysis tools that
>>> require
>>> compiler instrumentation without needing to understand the compiler
>>> internals or modifying the compiler, which greatly lowers the bar for
>>> developing dynamic-analysis tools.
>>>
>>> ================
>>> Approach
>>> ================
>>>
>>> The CSI framework inserts instrumentation hooks at salient locations
>>> throughout the compiled code of a program-under-test, such as function
>>> entry and exit points, basic-block entry and exit points, before and after
>>> each memory operation, etc.  Tool writers can instrument a
>>> program-under-test simply by first writing a library that defines the
>>> semantics of relevant hooks
>>> and then statically linking their compiled library with the
>>> program-under-test.
>>>
>>> At first glance, this brute-force method of inserting hooks at every
>>> salient location in the program-under-test seems to be replete with
>>> overheads.  CSI overcomes these overheads through the use of
>>> link-time-optimization (LTO), which is now readily available in most major
>>> compilers, including GCC and LLVM.  Using LTO, instrumentation hooks that
>>> are not used by a particular tool can be elided, allowing the overheads of
>>> these hooks to be avoided when the
>>>
>>>
>>> I don't understand this flow: the front-end emits all the possible
>>> instrumentation but the useless calls to the runtime will be removed during
>>> the link?
>>> It means that the final binary is specialized for a given tool right?
>>> What is the advantage of generating this useless instrumentation in the
>>> first place then? I'm missing a piece here...
>>>
>>
>> Here's the idea.  When a tool user, who has a program they want to
>> instrument, compiles their program source into an object/bitcode, he can
>> turn on the CSI compile-time pass to insert instrumentation hooks (function
>> calls to instrumentation routines) throughout the IR of the program.
>> Separately, a tool writer implements a particular tool by writing a library
>> that defines the subset of instrumentation hooks she cares about.  At link
>> time, the object/bitcode of the program source is linked with the object
>> file/bitcode of the tool, resulting in a tool-instrumented executable.
>> When LTO is used at link time, unused instrumentation is elided, and
>> additional optimizations can run on the instrumented program.  (I'm happy
>> to send you a nice picture that we have of this flow, if the mailing list
>> doesn't mind.)
>>
>>
>> Ok this is roughly what I had in mind.
>>
>> I still believe it is not great to rely on LTO, and better, it is not
>> needed to achieve this result.
>>
>> For instance, I don't see why the "library" that defines the subset of
>> instrumentation hooks used by this tool can't be fed during a regular
>> compile, and the useless hook be eliminated at this point.
>> Implementation detail, but in practice, instead of feeding the library
>> itself, the "framework" that allows to generate the library for the tool
>> writer can output a "configuration file" along side the library, and this
>> configuration file is what is fed to the compiler and tells the
>> instrumentation pass which of the hooks to generate. It sounds more
>> efficient to me, and remove the dependency on LTO.
>> I imagine there is a possible drawback that I'm missing right now...
>>
>
>
> I agree that the tool does not need to depend on full LTO. What is needed
> is essentially an option or configuration such that the compiler can find
> the bit code file(s) for the hooks during compilation time. It is pretty
> much similar to how math function inlining can be done ...
>

I agree, and I would strongly prefer that the design worked like this
rather than relying on LTO.

The flag for loading bitcode already exists, and is called
-mlink-bitcode-file. Projects such as libclc already use it, I believe.

What might be useful is if CSI improved the infrastructure around
-mlink-bitcode-file to make it more convenient to produce compatible
bitcode files. libclc for example relies on a post-processing pass to
change symbol linkage, and I think that can be avoided by changing symbol
linkages as they are imported from the bitcode file.

Peter

David
>
>
>
>
>>
>>
>>
>> The final binary is specialized to a given tool.  One advantage of CSI,
>> however, is that a single set of instrumentation covers the needs of a wide
>> variety of tools, since different tools provide different implementations
>> of the same hooks.  The specialization of a binary to a given tool happens
>> at link time.
>>
>>
>>>
>>> instrumented program-under-test is run.  Furthermore, LTO can optimize a
>>> tool's instrumentation within a program using traditional compiler
>>> optimizations.  Our initial study indicates that the use of LTO does not
>>> unduly slow down the build time
>>>
>>>
>>> This is a false claim: LTO has a very large overhead, and especially is
>>> not parallel, so the more core you have the more the difference will be. We
>>> frequently observes builds that are 3 times slower. Moreover, LTO is not
>>> incremental friendly and during debug (which is very relevant with
>>> sanitizer) rebuilding involves waiting for the full link to occur again.
>>>
>>>
>> Can you please point us towards some projects where LTO incurs a 3x
>> slowdown?  We're interested in the overhead of LTO on build times, and
>> although we've found LTO to incur more overhead on parallel build times
>> than serial build times, as you mentioned, the overheads we've measured on
>> serial or parallel builds have been less than 40% (which we saw when
>> building the Apache HTTP server).
>>
>>
>> I expect this to be reproducible on most non-trivial C/C++ programs.
>> But taking clang as an example, just running `ninja clang` on OS X a
>> not-so-recent 12-cores machine takes 970s with LTO and 252s without (and I
>> believe this is without debug info...).
>> Running just `ninja` to build all of llvm/clang here would take *a lot*
>> longer with LTO, and not so much without.
>>
>> The LTO builds without assert
>>
>> Best,
>>
>> --
>> Mehdi
>>
>>
>>
>>
>>
>> We've also designed CSI such that it does not depend on LTO for
>> correctness; the program and tool will work correctly with ordinary ld.  Of
>> course, the downside of not using LTO is that instrumentation is not
>> optimized, and in particular, unused instrumentation will incur overhead.
>>
>>
>>>
>>> --
>>> Mehdi
>>>
>>> , and the LTO can indeed optimize away unused hooks.  One of our
>>> experiments with Apache HTTP server shows that, compiling with CSI and
>>> linking with the "null" CSI-tool (which consists solely of empty hooks)
>>> slows down the build time of the Apache HTTP server by less than 40%, and
>>> the resulting tool-instrumented executable is as fast as the original
>>> uninstrumented code.
>>>
>>>
>>> ================
>>> CSI version 1
>>> ================
>>>
>>> The initial version of CSI supports a basic set of hooks that covers the
>>> following categories of program objects: functions, function exits
>>> (specifically, returns), basic blocks, call sites, loads, and stores.  We
>>> prioritized instrumenting these IR objects based on the need of seven
>>> example CSI tools, including a race detector, a cache-reuse analyzer, and a
>>> code-coverage analyzer.  We plan to evolve the CSI API over time to be more
>>> comprehensive, and we have designed the CSI API to be extensible, allowing
>>> new instrumentation to be added as needs grow.  We chose to initially
>>> implement a minimal "core" set of hooks, because we felt it was best to add
>>> new instrumentation on an as-needed basis in order to keep the interface
>>> simple.
>>>
>>> There are three salient features about the design of CSI.  First, CSI
>>> assigns each instrumented program object a unique integer identifier within
>>> one of the (currently) six program-object categories.  Within each
>>> category, the ID's are consecutively numbered from 0 up to the number of
>>> such objects minus 1.  The contiguous assignment of the ID's allows the
>>> tool writer to easily keep track of IR objects in the program and iterate
>>> through all objects in a category (whether the object is encountered during
>>> execution or not).  Second, CSI provides a platform-independent means to
>>> relate a given program object to locations in the source code.
>>> Specifically, CSI provides "front-end-data (FED)" tables, which provide
>>> file name and source lines for each program object given the object's ID.
>>> Third, each CSI hook takes in as a parameter a "property": a 64-bit
>>> unsigned integer that CSI uses to export the results of compiler analyses
>>> and other information known at compile time.  The use of properties allow
>>> the tool to rely on compiler analyses to optimize instrumentation and
>>> decrease overhead.  In particular, since the value of a property is known
>>> at compile time, LTO can constant-fold the conditional test around a
>>> property to elide unnecessary instrumentation.
>>>
>>> ================
>>> Future plan
>>> ================
>>>
>>> We plan to expand CSI in future versions by instrumenting additional
>>> program objects, such as atomic instructions, floating-point instructions,
>>> and exceptions.  We are also planning to provide additional static
>>> information to tool writers, both through information encoded in the
>>> properties passed to hooks and by other means.  In particular, we are also
>>> looking at mechanisms to present tool writers with more complex static
>>> information, such as how different program objects relate to each other,
>>> e.g., which basic blocks belong to a given function.
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160617/bd8738e3/attachment.html>