[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Fri Mar 30 11:49:53 PDT 2012

On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:
> Dear LLVMers,
>
>    My name is Raphael Ernani, and I am doing my MsC at the Federal
> University of Minas Gerais, Brazil. I have been using LLVM for a
> while, and I would like to participate in this year's Summer of Code.
> One particular idea, in your "open projects" page caught my eye, and I
> decided to write a proposal about it. The line that I liked in the
> page was "Create an LLVM pass that adds memory safety checks to code,
> like Valgrind does for binaries, or like mudflap does for gcc compiled
> code.", and my proposal is written below:

Hi Raphael!

I haven't read your proposal in detail, but there are actually three 
projects that implement memory safety by transforming LLVM bitcode:

1) ASAN is integrated into LLVM.  It is meant as a low-overhead 
debugging tool in the spirit of Valgrind.  It's primary technique is to 
place guard memory in between memory objects and check loads and stores 
to see if they return a poisoned bit-patter from a guard.  It isn't 
sound (certain dangling pointer errors and array bound violations can 
get past it), but it's easy to implement.

2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to 
provide sound memory safety checks and is a sub-project of LLVM.  It 
inserts function pointer checks, precise structure and array bounds 
checks, load/store checks, and checks on C library function calls into 
code.  It is also designed to work with automatic pool allocation to 
provide sound points-to analysis and partial type-safety, but these 
features are currently disabled because they are not sufficiently 
robust.  While initially designed to protect applications during 
production, SAFECode has a generic pass to add debug information to its 
run-time checks, essentially make it a valgrind replacement like ASAN.

3) SoftBound and its CETS extension 
(http://www.cis.upenn.edu/acg/softbound) have been integrated into the 
SAFECode compiler and can be enabled with options in SAFECode's clang.  
It provides array bounds checking and, with CETS, optional dangling 
pointer detection.

While I don't think we need another memory safety checker for LLVM, I do 
think that there's *a lot* of interesting projects in making each of 
these tools better.  Some ideas that come to mind:

o Creating common instrumentation passes for ASAN, SAFECode, and 
SoftBound to permit code reuse.  All three essentially place checks on 
loads, stores, and atomic operations.  If they use the same run-time 
check names, then the same optimizations can improve the speed of all three.

o Creating or improving optimizations for memory safety checks.  Static 
array bounds checking alone provides numerous options, such as 
implementing ABCD, using value-range analysis (mentioned in another GSoC 
post), investigating SMT solver-based techniques, etc, etc.  Other 
optimizations include hoisting checks out of loops, polishing and 
re-enabling the SAFECode type-safety optimization, and removing 
redundant load/store checks.

o Polishing and enabling the Baggy Bounds Checking (BBC) run-time.  This 
run-time can speed up the run-time checks and should be faster than the 
splay-tree approach that SAFECode currently uses.

I'll go ahead and create a list of open projects on the SAFECode web 
page (http://sva.cs.illinois.edu), and I'll remove the "Valgrind" tool 
project off the open projects page as there are now several memory 
safety tools built using LLVM.  If you want my opinion, I think the 
static array bounds checker or the monotonic loop optimization make 
nice, self-contained projects.

Finally, you might be interested in the Memory Safety Menagerie 
(http://sva.cs.illinois.edu/menagerie/).  This web page contains a whole 
list of papers on the subject of memory safety transforms.

-- John T.

>
>
> ================================================
> Adding memory safety checks to the LLVM bitcodes
> ================================================
>
>
> Objective
> ---------
>
> The objective of this project is to create an LLVM pass that adds
> memory safety checks to code, like Valgrind does for binaries, or like
> mudflap does for gcc compiled code.
>
>
> How it will be done
> -------------------
>
> We will achieve our goal by creating a LLVM pass that does the following:
> 1 - Propagate the array size to the sites in the program where the
> array is accessed.
> 2 - Insert code in the program to make a bound-check and throw an
> exception if the program tries to access a position out of the array
> bounds.
> 3 - Eliminate redundant bound-checks.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. Tools
> like Valgrind and mudflap are very useful; however, LLVM does not have
> any similar function yet. Our goal is to help filling this gap.
>
>
> How we plan to improve the code
> -------------------------------
>
> Before every read or write in a position of an array, our pass will
> add instructions to check if the index of the array is outside the
> bounds.
> The secured program, without optimizations, is shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i, tmp0;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                if (i<0 or i>10) raise("Array out of bounds.");
>                A[i] = i;
>        }
>        tmp0 = 11;
>        if (tmp0<0 or tmp0>10) raise("Array out of bounds.");
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> In order to accomplish this object, the new LLVM pass will have to
> perform a number of actions:
> 1) Collect the allocation sizes of each array. This can be done at
> runtime, simply reading the arguments passed to the allocation
> function.
> 2) Store the size of the array together with the allocated area. This
> can also be done dynamically, increasing in 32 bits the size of the
> allocated area that is given to the array base address.
> 3) Prefix any array access code with a bounds check to ensure that the
> used index is inside the correct limits.
>
> After this, to improve the performance of the "safe programs", I can
> remove redundant checks, by not adding new checks where already exist
> a safety check. I can use two different abstract domains to accomplish
> this goal, a simpler, and another more complex one:
>
> 1) The domain of "less-than" names. This is the domain used in the
> ABCD array bounds check algorithm [2]. For each variable v is gives
> the set of variables that are provably less than v. It is relatively
> simple - in terms of implementation - to compute this domain; however,
> this computation is costly: it is cubic on the number of variables in
> the program. Nevertheless, I can implement the domain, to see if it
> pays off its costs. There was an implementation of ABCD in LLVM 2.7,
> and I can reuse a substantial part of that code.
>
> 2) The domain of intervals. For each variable v, this domain gives an
> interval [l, u] with the minimum and maximum values that v can assume
> during the program execution. I can use the implementation of range
> analysis that we have developed in the Federal University of Minas
> Gerais to compute this domain. The implementation is quite mature now,
> and I was one of the developers that worked on it.
>
> Linked programs
> ----------------
>
> In order to be effective, the instrumentation pass should be a whole
> program analysis. However, even in this case, only the code that has
> been compiled with the pass will be secure. It is possible that
> library calls still lead to undefined behavior, like in the program
> below:
>
> // We do not have access to this code, because it is an external library:
> void foo(int* A, int i){
>        A[i] = i;
> }
>
> // We have secured this code with our instrumentation:
> #include <stdlib.h>
> #include <stdio.h>
> #include "A.h"
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                foo(A,i);
>        }
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Change the allocation scheme used by LLVM to store the array
> size next to the area that is allocated to the array.
>
> [5-6] Write the pass to insert access checks before array accesses.
>
> [7-10] Improve the performance of safe programs.
> - Avoid inserting bound-checks where already exists a memory check.
> - Avoid inserting bound-checks where the "less-than" domain indicates
> that the memory access is safe.
>
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at 
> (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation). 
>
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
> and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/cc95419e/attachment.html>