[cfe-dev] Making MSAN Easier to Use: Providing a Sanitized Libc++

Craig, Ben via cfe-dev cfe-dev at lists.llvm.org
Wed Aug 17 12:12:08 PDT 2016


On 8/16/2016 7:35 PM, Evgenii Stepanov via cfe-dev wrote:
> So, I'd argue that proper support for sanitized shared libraries
> (primarily libc++, but not just libc++) would require loader change.
> We could start by agreeing on and specifying a way a binary would
> declare it's "sanitizer type" which could be used at runtime to change
> the library lookup path.
I don't think you need loader changes if you are willing to change the 
soname and use version tags.  I don't think loader changes will fix any 
of the pathological cases that soname and version tag trickery won't fix.

Here's a brief proposal for a way to do this...
* Give sanitized libraries a different file name.  For example, 
libc++-msan.so.
* Give sanitized libraries a different soname.  For example, 
libc++-msan.so.1.
* Put all symbols in libc++ under a version tag (LIBCPP_MSAN perhaps).  
The "regular" build of libc++ will continue to use unversioned symbols.
* Install the sanitized libc++ in the same directory as the regular libc++.
* Change the clang driver so that -fsanitize=memory will cause the 
linker to pull in libc++-msan.so instead of libc++.so.  This will cause 
the DT_NEEDED to point at the msan version of libc++, and it will cause 
all the unresolved symbols to point to @LIBCPP_MSAN versions of the symbols.

How this works in mixed envrionments...
Case 1 (great!):
* User builds an msan version of an executable (msan_tester).  If you 
point 'nm' at msan_tester, you will see that it has a lot of standard 
library symbols with @LIBCPP_MSAN on them.  'ldd' will tell you that 
msan_tester will pull in libc++-msan.so, but not libc++.so.
* Suppose msan_tester does a dlopen and dlsym of a non-msan'd C++ 
library.  That C++ library was built against regular libc++. Regular 
libc++ gets loaded, but none of it's symbols will get put in the global 
symbol table, because libc++-msan.so got there first.
* Happy day, only one version of libc++ is getting used (though two 
different ones got loaded).

Case 2(boo!):
* User builds an regular version of an executable (normal_tester)
* normal_tester does a dlopen and dlsym of an msan'd C++ library. The 
msan'd C++ library is still going to bind against the libc++-msan.so 
version of the symbols.
* Sad day.  Two versions of libc++ are being used.

Note that changing the loader wouldn't fix case 2 either, at least as I 
understand the proposal.

>
> Also, we can solve this for the case of -static-libstdc++ easily in
> the clang driver by looking under /msan/ subdirectory first. With
> that, we could replace the whole msan bootstrap instruction [1] with
> just "use -static-libstdc++".
>
> [1] https://github.com/google/sanitizers/wiki/MemorySanitizerBootstrappingClang
>
>
> On Mon, Aug 15, 2016 at 2:34 PM, Jonathan Roelofs
> <jonathan at codesourcery.com> wrote:
>>
>> On 8/15/16 1:51 PM, Hal Finkel wrote:
>>> ----- Original Message -----
>>>> From: "Jonathan Roelofs" <jonathan at codesourcery.com> To: "Hal
>>>> Finkel" <hfinkel at anl.gov> Cc: "Eric Fiselier" <eric at efcs.ca>,
>>>> "clang developer list" <cfe-dev at lists.llvm.org>, "Chandler
>>>> Carruth" <chandlerc at gmail.com>, "Kostya Serebryany"
>>>> <kcc at google.com>, "Evgenii Stepanov" <eugenis at google.com> Sent:
>>>> Monday, August 15, 2016 9:24:17 AM Subject: Re: [cfe-dev] Making
>>>> MSAN Easier to Use: Providing a Sanitized Libc++
>>>>
>>>>
>>>>
>>>> On 8/14/16 7:31 PM, Hal Finkel wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Jonathan Roelofs via cfe-dev" <cfe-dev at lists.llvm.org>
>>>>>> To: "Eric Fiselier" <eric at efcs.ca>, "clang developer list"
>>>>>> <cfe-dev at lists.llvm.org>, "Chandler Carruth"
>>>>>> <chandlerc at gmail.com>, "Kostya Serebryany" <kcc at google.com>,
>>>>>> "Evgenii Stepanov" <eugenis at google.com> Sent: Sunday, August
>>>>>> 14, 2016 7:07:00 PM Subject: Re: [cfe-dev] Making MSAN Easier
>>>>>> to Use: Providing a Sanitized   Libc++
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/16 4:05 PM, Eric Fiselier via cfe-dev wrote:
>>>>>>> Sanitizers such as MSAN require the entire program to be
>>>>>>> instrumented, anything less leads to plenty of false
>>>>>>> positives. Unfortunately this can be difficult to achieve,
>>>>>>> especially for the C and C++ standard libraries. To work
>>>>>>> around this the sanitizers provide interceptors for common C
>>>>>>> functions, but the same solution doesn't work as well for the
>>>>>>> C++ STL. Instead users are forced to manually build and link
>>>>>>> a custom sanitized libc++. This is a huge PITA and I would
>>>>>>> like to improve the situation, not just for MSAN but all
>>>>>>> sanitizers. I'm working on a proposal to change this. The
>>>>>>> basis of my proposal is:
>>>>>>>
>>>>>>> Clang should install/provide multiple sanitized versions of
>>>>>>> Libc++ and a mechanism to easily link them, as if they were
>>>>>>> a Compiler-RT runtime.
>>>>>>>
>>>>>>> The goal of this proposal is:
>>>>>>>
>>>>>>> (1) Greatly reduce the number of false positives caused by
>>>>>>> using an un-sanitized STL. (2) Allow sanitizers to catch user
>>>>>>> bugs that occur within the STL library, not just its
>>>>>>> headers.
>>>>>>>
>>>>>>> The basic steps I would like to take to achieve this are:
>>>>>>>
>>>>>>> (1) Teach the compiler-rt CMake how to build and install
>>>>>>> each sanitized libc++ version along side its other runtimes.
>>>>>>> (2) Add options to the Clang driver to support linking/using
>>>>>>> these libraries.
>>>>>>>
>>>>>>> I think this proposal is likely to be contentious, so I
>>>>>>> would like to focus on the details it. Once I have some
>>>>>>> feedback on these details I'll put together a formal
>>>>>>> proposal, including a plan for implementing it. The details I
>>>>>>> would like input on are:
>>>>>>>
>>>>>>> (A) What kind and how many sanitized versions of libc++
>>>>>>> should we provide?
>>>>>>>
>>>>>>> ---------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>> I think the minimum set would be Address (which includes Leak),
>>>>>>> Memory (With origin tracking?), Thread, and Undefined. Once
>>>>>>> we get into combinations of sanitizers things get more
>>>>>>> complicated. What other sanitizer combinations should we
>>>>>>> provide?
>>>>>>>
>>>>>>> (B) How should we handle UBSAN?
>>>>>>> ---------------------------------------------------
>>>>>>>
>>>>>>> UBSAN is really just a collection of sanitizers and
>>>>>>> providing sanitized versions of libc++ for every possible
>>>>>>> configuration is out of the question. Instead we should
>>>>>>> figure out what subset of UBSAN checks we want to enable in
>>>>>>> sanitized libc++ versions. I suspect we want to disable the
>>>>>>> following checks.
>>>>>>>
>>>>>>> * -fsanitize=vptr * -fsanitize=function *
>>>>>>> -fsanitize=float-divide-by-zero
>>>>>>>
>>>>>>> Additionally UBSAN can be combined with every other
>>>>>>> sanitizer group (ie Address, Memory, Thread). Do we want to
>>>>>>> provide a combination of UBSAN on/off for every group, or can
>>>>>>> we simply provide an over-sanitized version with UBSAN on?
>>>>>>>
>>>>>>> (C) How should the Clang driver expose the sanitized
>>>>>>> libraries to the users?
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>> I would like to propose the driver option '-fsanitize-stdlib' and
>>>>>>> '-fsanitize-stdlib=<sanitizer>'. The first version deduces
>>>>>>> the best sanitized version to use, the second allows it to
>>>>>>> be explicitly specified.
>>>>>>>
>>>>>>> A couple of other options are:
>>>>>>>
>>>>>>> * -fsanitize=foo:  Implicitly turn on a sanitized STL. Clang
>>>>>>> deduces which version. * -stdlib=libc++-<sanitizer>:
>>>>>>> Explicitly turn on and choose a sanitized STL.
>>>>>>>
>>>>>>> (D) Should sanitized libc++ versions override libc++.so?
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>> For example, what happens when a program links to both a sanitized
>>>>>>> and non-sanitized libc++ version? Does the sanitized version
>>>>>>> replace the non-sanitized version, or should both versions
>>>>>>> be loaded into the program?
>>>>>>>
>>>>>>> Essentially I'm asking if the sanitized versions of libc++
>>>>>>> should have the "soname" libc++ so they can replace
>>>>>>> non-sanitized version, or if they should have a different
>>>>>>> "soname" so the linker treats them as a separate library.
>>>>>>>
>>>>>>> I haven't looked into the consequences of either approach in
>>>>>>> depth, but any input is appreciated.
>>>>>>
>>>>>> In a sense, these are /just/ multilibs, so my inclination would
>>>>>> be to make all the soname's the same, and just stick them in
>>>>>> appropriately named subfolders relative to their normal
>>>>>> location.
>>>>>
>>>>> I'm not sure that's true; there's no property of the environment
>>>>> that determines which library path you need. As a practical
>>>>> matter, I can't set $PLATFORM and/or $LIB in my rpath and have
>>>>> ld.so do the right thing in this context. Moreover, it is really
>>>>> a property of how you compiled, so I think using an alternate
>>>>> library name is natural.
>>>>
>>>> Multilibs solve exactly the problem of "it's a property of how you
>>>> compiled". The thing that's subtly different here is that the
>>>> usual thing that people do with multilibs is to provide ABI
>>>> incompatible versions of the same library (which are made
>>>> incompatible via compiler flags, -msoft-float, for example),
>>>> whereas these libraries just so happen to be ABI compatible with
>>>> their non-instrumented variants.
>>>>
>>>> I'm not sure I understand what you're saying about $PLATFORM and
>>>> $LIB, but I /think/ it's a red herring: the compiler takes care of
>>>> adding in the multilib suffixes where appropriate, so shouldn't the
>>>> answer to "which library do I stick in the rpath?" include said
>>>> suffix (when compiled with Eric's proposed flag)?
>>>
>>> I'm not sure what color herring it is ;) -- I'm trying to understand
>>> the system you're proposing:
>>>
>>> 1. User A compiles/installs Clang/LLVM/libc++ on system A in
>>> /local/clang, and so we get a /local/clang/lib/libc++.so and a
>>> /local/clang/lib/msan/libc++.so. User A compiles a program, foo, with
>>> msan enabled, and foo gets an rpath of /local/clang/lib/msan. User A
>>> also compiles another program, prod, without any sanitizers, and
>>> those get an rpath of /local/clang/lib.
>>>
>>> 2. User B compiles/installs Clang/LLVM/libc++ on system B in
>>> /soft/clang, and so we get a /soft/clang/lib/libc++.so and a
>>> /soft/clang/lib/msan/libc++.so. User A sends User B the executables
>>> foo and prod. Those executables have rpaths with /local/clang/...,
>>> but those don't help User B. User B has an environment with
>>> LD_LIBRARY_PATH=/soft/clang/lib so that the executables compiled by
>>> User A will run.
>>>
>>> 3. User B has no good option, because if LD_LIBRARY_PATH is set to
>>> /soft/clang/lib, then prod will behave as expected (i.e. not be
>>> sanitized), but foo will not. If LD_LIBRARY_PATH is set to
>>> /soft/clang/lib/msan, then foo will be sanitized as expected, but
>>> prod will run slower than usual.
>>
>> Ahhh, I see. I was imagining this sort use case:
>>
>> first_guy$ cat lib.h
>> extern void lib_func();
>>
>> first_guy$ cat lib.c
>> #include "lib.h"
>>
>> #include <stdio.h>
>>
>> void lib_func() {
>>    printf("In %s\n", MESSAGE);
>> }
>> first_guy$ cat bin.c
>> #include "lib.h"
>>
>> int main() {
>>    lib_func();
>> }
>> first_guy$ mkdir -p lib/sanitized
>> first_guy$ clang lib.c -shared -DMESSAGE="\"sanitized\"" -o
>> lib/sanitized/library.so
>> first_guy$ clang lib.c -shared -DMESSAGE="\"production\"" -o lib/library.so
>> first_guy$ clang bin.c -lrary -Wl,-rpath,$PWD/lib -L./lib/sanitized/ -o
>> sanitized
>> first_guy$ clang bin.c -lrary -Wl,-rpath,$PWD/lib -L./lib/ -o production
>> first_guy$ ./sanitized
>> In sanitized
>> first_guy$ ./production
>> In production
>> first_guy$ mkdir ../other_guy
>> first_guy$ cd ../other_guy/
>> other_guy$ cp ../first_guy/sanitized .
>> other_guy$ cp ../first_guy/production .
>> other_guy$ cp -r ../first_guy/lib .
>> other_guy$ ./sanitized
>> In sanitized
>> other_guy$ ./production
>> In production
>> other_guy$ rm lib/library.so
>> other_guy$ ln -s ../lib/sanitized/library.so lib/library.so
>> other_guy$ ./production
>> In sanitized
>> other_guy$ ./sanitized
>> In sanitized
>>
>>
>> Jon
>>
>>
>>> 4. User B compiles programs to send to User A. User A then sets
>>> LD_LIBRARY_PATH to /local/clang/lib. User A has the same problem as
>>> User B, and moreover, if User A compiles using -W,--enable-new-dtags,
>>> then the linker will use DT_RUNPATH (instead of, or in addition to,
>>> DT_RPATH; effect is the same), which is the recommended default on
>>> many systems, the rpath scheme won't even work for User A on User A's
>>> own executables (because LD_LIBRARY_PATH overrides DT_RUNPATH).
>>>
>>> There are a few things, other than pure directory paths, that can
>>> appear in, or otherwise affect, LD_LIBRARY_PATH and
>>> DT_RPATH/DT_RUNPATH, but I don't think any of them help us here:
>>>
>>> 1. Pseudo variables $ORIGIN, $LIB and $PLATFORM - These are expanded
>>> by ld.so based on properties of the current execution environment
>>> (e.g. whether you're loading a 32-bit or 64-bit executable, the
>>> hardware architecture).
>>>
>>> 2. Hardware-capability strings - There are a fixed set of hardware
>>> capabilities, such as sse, sse2, altivec, etc. that are appended to
>>> the directory name to form alternate search paths.
>>>
>>> 3. The multilib suffix. This, AFAIK, is baked into the dynamic
>>> loader. The path to the loader itself has the multilib suffix, and
>>> that's specified in PT_INTERP.
>>>
>>> Unfortunately, I don't think that any of these help us.
>>>
>>> -Hal
>>>
>>>> Jon
>>>>
>>>>> -Hal
>>>>>
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>>> Conclusion -----------------
>>>>>>>
>>>>>>> I hope my proposal and questions have made sense. Any and
>>>>>>> all input is appreciated. Please let me know if anything
>>>>>>> needs clarification.
>>>>>>>
>>>>>>> /Eric
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________ cfe-dev
>>>>>>> mailing list cfe-dev at lists.llvm.org
>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>
>>>>>> -- Jon Roelofs jonathan at codesourcery.com CodeSourcery / Mentor
>>>>>> Embedded _______________________________________________
>>>>>> cfe-dev mailing list cfe-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>
>>>> -- Jon Roelofs jonathan at codesourcery.com CodeSourcery / Mentor
>>>> Embedded
>>>>
>> --
>> Jon Roelofs
>> jonathan at codesourcery.com
>> CodeSourcery / Mentor Embedded
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project



More information about the cfe-dev mailing list