[LLVMdev] RFC: How can AddressSanitizer, ThreadSanitizer, and similar runtime libraries leverage shared library code?

Tue Jun 19 22:46:03 PDT 2012

On Wed, Jun 20, 2012 at 9:39 AM, Chandler Carruth <chandlerc at google.com>wrote:

> On Tue, Jun 19, 2012 at 9:07 PM, Kostya Serebryany <kcc at google.com> wrote:
>
>> +dvyukov
>>
>> On Wed, Jun 20, 2012 at 7:12 AM, Chandler Carruth <chandlerc at google.com>wrote:
>>
>>> Hello folks (and sorry if I've forgotten to CC anyone with particular
>>> interest to this discussion...):
>>>
>>> I've been thinking a lot about how best to build advanced runtime
>>> libraries like ASan, and scale them up. Note that this does *not* try to
>>> address any licensing issues. For now, I'll consider those orthogonal /
>>> solvable w/o technical contortions. =]
>>>
>>> My primary motivation: we really, *really* need runtime libraries to be
>>> able to use common, shared libraries.
>>>
>>
>> I am not sure you understand the problem as we do.
>>
>> In short, asan/tsan/msan/etc can not use any function which is also
>> called from the instrumented binary.
>>
>
> Well, I can't be sure, but this description certainly agrees with my
> understanding -- you need *every* part of the runtime to be completely
> separate from *every* part of the instrumented binary. I'm with you there.
>
> In particular, I think the current strategy for libc & system calls makes
> perfect sense, and I'm not trying to suggest changing it.
>
> I think the most similar situation is is this one:
>
> In the previous version of ThreadSanitizer we used a private copy of
>> STLport in a separate namespace and a custom libc (small subset).
>>
>
> My proposal is very similar except without the need to modify the C++
> standard library in use. Instead, I'm suggesting post-processing the
> library to ensure that the standard C++ library code in the runtime is kept
> complete distinct from that in the instrumented binary -- everything would
> in fact be *mangled* differently.
>
> The goal would be to avoid the maintenance overhead of a custom C++
> standard library, and instead use a normal one. My understanding is that
> both GCC's libstdc++ and LLVM's libc++ are significantly higher quality
> than STLport, and if we're doing static linking, the code bloat should be
> greatly reduced. We could reduce it still further by doing LTO of the
> runtime library, which should be very straight forward given the rest of my
> proposal.
>
> It would still require a very small subset of libc, likely not much more
> than you already have.
>
>  This worked, but had problems too (Dmitry was very angry at STLport for
>> code bloat, stack size increase and some direct libc calls).
>>
>
> I would be interested to know if the above addresses most of the problems
> or not.
>
>
>>  Until recently this was not causing too much pain in asan/tsan, but our
>> attempts to use the LLVM DWARF readers made it worse.
>> When tsan finds a race, we need to symbolize it online to be able to
>> match against a suppression and decide whether we want to emit the warning.
>> Today we do it in a separate addr2line process (ugly and slow).
>> But if we start calling the LLVM dwarf reader we end up with all possible
>> dependency problems (Dmitry and Alexey will know the exact ones) because
>> the LLVM code calls to malloc, memcpy, etc.
>>
>> Frankly, I don't have any solution other than to change the code such
>> that it does not call libc/libc++.
>> Some of that may be solved by a private copy of STLport + a bit of custom
>> libc (but see above about STLport)
>>
>
> I think my proposal is essentially in between these two:
>
> - Avoid the need for a low quality STL by using a normal C++ standard
> library implementation, and avoid maintenance burden by doing a link-time
> mangling of the symbols.
>

re-linking might be too platform specific.
How about compiling the library into LLVM bitcode and adding
namespaces/prefixes to that bitcode?

--kcc

> - Provide the minimal custom libc, and do the same to it
> - Link the LLVM libraries against these, and munge their symbols as well
> - LTO the whole thing if needed to get the code bloat down
>
> I think this is actually easier than changing the LLVM libraries to not
> use the C++ standard libraries. I also think it is easier than
> re-implementing the LLVM libraries in question. But that doesn't mean I
> think it is easy. ;] I think it is quite hard, but it is the best solution
> I can come up with.
>
>
>>
>> --kcc
>>
>>
>>
>>> This starts with libraries such as the C++ standard library -- a runtime
>>> shouldn't need to re-implement std::vector. It includes other primitive
>>> libraries that have had significant effort put into them in LLVM such as
>>> the ADT and Support libraries. But, IMO, it has even more importance as we
>>> start looking at libraries such as ELF readers, DWARF readers, symbolizers,
>>> etc. This code should shared, and shared easily, with other LLVM projects.
>>>
>>> However, clearly the runtime must at some point be linked against a
>>> program, and indeed programs which may be using *the same set of
>>> libraries*. It is crucially important that the runtime uses a separate
>>> implementation of the libraries from the ones used by the program itself:
>>> we will often compile the program's libraries with instrumentation and
>>> other features which we explicitly wish to avoid in the runtime. Even
>>> simple name clashes can cause problems, leading to the current practice of
>>> putting all of these runtime libraries into a '__sanitizer' or other
>>> specially spelled namespace.
>>>
>>> A final unusual requirement is that at least *some* of the code for the
>>> runtime libraries must be statically linked to have reasonable efficiency.
>>> We also have several use cases where it would be very convenient to link
>>> *all* of the runtime statically, so I prefer a solution that preserves this
>>> option.
>>>
>>> So how can we effectively share code? Here is my proposal, and a few
>>> alternate strategies.
>>>
>>> I suggest that we build the runtime library as-if it were not a runtime
>>> library at all, and just a normal library. No strange namespaces, no
>>> restrictions on what other libraries it uses with one exception: they must
>>> *all* be statically linkable. We build this as a normal archive library,
>>> nothing special. One nice property is that testing the runtime library
>>> becomes the same as testing any other library.
>>>
>>> Then, we have a special build step to produce a final archive which is
>>> actually *used* as the runtime library. This step works not dissimilarly to
>>> the step to link an executable: we build the list of archive libraries
>>> depended on, but instead of linking an executable, we run a linker script
>>> over them. This script will re-link each '.o' file from the transitive
>>> closure of archives, prepending a '__asan__' (or other runtime library
>>> prefix) onto each symbol; effectively mangling each symbol. All of these
>>> processed '.o' files would go into a single, final archive that would be
>>> the installed runtime library. The only functions not processed in this
>>> manner are a white list of "exported" functions from the runtime (C-library
>>> routines provided by the runtime, and runtime entry points, et.).
>>>
>>> The result should be a runtime library that is essentially hermetic, and
>>> should have no clashes with binaries it links against. It would be free to
>>> use standard libraries, LLVM libraries, whatever it needs. That said, there
>>> are some clear disadvantages:
>>> - Bizarre name mangling, especially for C++
>>> - Potentially incompatible with C++ EH, libunwind, or other tools (I
>>> just don't know, haven't done enough research here)
>>> - Requires "relinking" the final runtime
>>> - Definitely implementable on Linux & ELF-based BSDs, I *think* do-able
>>> on Darwin, but I have no idea about Windows.
>>> - Other downsides? I'm probably missing some big problems here... ;]
>>>
>>> However, if we can make this (possibly with tweaks/modifications) work,
>>> I think the upside is quite large -- the runtime library stops having to be
>>> written in such a strange special sub-set of the language, etc.
>>>
>>>
>>> Note that this proposal is orthogonal to the issue of minimizing the
>>> binary size and cost of the runtime library -- that is clearly still an
>>> important concern, but that can be addressed both with or without using
>>> other libraries. LLVM has lots of libraries specifically engineered to be
>>> lightweight in situations like this.
>>>
>>>
>>> Other alternatives that have been discussed:
>>>
>>> - Require isolating all shared code into a shared library (.so) than is
>>> loaded as-needed. This helps some, but it doesn't seem to fully solve the
>>> issues (where does the shared code go? the .so? What happens when it is
>>> loaded into a program that already has copies of the same code? What
>>> happens when one is instrumented and the other isn't). It also requires us
>>> to ship the '.so' with the binary to get full functionality, something that
>>> would be at least somewhat undesirable. It also requires the runtime
>>> library developers to carefully partition the code into that which can go
>>> in the .a and that which can go in the .so.
>>>
>>> - The current strategy of re-implementing everything needed from
>>> (essentially) the ground up inside the runtime library. I think that this
>>> has serious long-term maintenance problems.... but who knows, maybe?
>>>
>>> - Other ideas?
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120620/16ea0ca4/attachment.html>