[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Tue Apr 3 08:20:16 PDT 2012

On 4/3/12 9:54 AM, Raphael Ernani Rodrigues wrote:
> Dear LLVMers,
>
> I wrote a new proposal, to improve the static array bounds checking in
> SAFEcode, as follows:

I haven't had time to read over this carefully; hopefully I can do that 
tomorrow.  However, one quick comment is that SAFECode already has a 
pass that instruments GEPs (the array indexing instruction in LLVM) with 
run-time checks.  You can see the code at the URL below (or you can 
check out SAFECode and examine the source code):

http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/GEPChecks.cpp?revision=136382&view=markup

SAFECode is moving to a design in which the Clang front-end inserts 
run-time checks and then later passes (including passes within libLTO) 
optimize those checks away.  It makes the code easier to maintain and 
understand, and it makes experiments easier.

Since your analysis will need to be run within libLTO *after* SAFECode 
has inserted its run-time checks, I recommend that you write a pass that 
*removes* run-time checks using your analysis.

I think you should read the (very early draft of the) SAFECode Software 
Architecture Manual located within the SAFECode source tree (you'll need 
LaTeX to compile it to a PDF file).  It describes the source code layout 
and provides an overview of how the analysis and instrumentation passes 
are integrated into Clang/libLTO.

-- John T.

>
>
>
> Improving static array bounds checking in SAFEcode
> ==================================================
>
>
> Objective
> ---------
>
> the main objective of this project is to improve the static array
> bounds checking engine used in SAFECode. It was written after the open
> project at http://safecode.cs.illinois.edu/projects.html, which reads
> as follows:
>
> "Improve static array bounds checking: SAFECode used to have an
> inter-procedural static array bounds checking pass, but it suffered
> serious scalability issues and was eventually disabled. Improving the
> old code or writing a new pass would be extremely beneficial."
>
> I plan to build a new array bounds checking algorithm from scratch.
>
> How it will be done
> -------------------
>
> I will build an array bounds checking algorithm based on the ABCD
> algorithm that was described by Bodik in the paper [3]. The key
> insight in the ABCD was to use a lattice that some researchers, e.g.,
> [2], call the "less-than" lattice. Instead of determining precise
> ranges for the variables using for instance, the interval lattice,
> Bodik's less-than lattice finds, for each variable v, the set of other
> variables that are less-than it.
>     It is a requirement that the algorithm must be relient against the
> wrapping semantics of 2-complement's arithmetic. The original
> algorithm proposed by Bodik is not. However, it is possible to limit
> Bodik's algorithm in such a way that only overflow resilient
> equalities are taken into consideration. In other words, if the source
> program contains a comparison such as:
>
> if (x<  y) {
>   // p1
> } else {
>   // p2
> }
>
>    then we are guaranteed that inside p1 we have that x<  y.
>
>     I propose an analysis that, contrary to the original ABCD
> algorithm is inter-procedural. To obtain context sensitiveness, which
> is very important, I will initially use function inlining, which is
> available in LLVM. If time allows it, I will implement true context
> sensitiveness using an "on-demand" approach on the call graph of
> functions. Context sensitiveness is remarkably important in this type
> of domain because it is necessary to match array sizes and identifiers
> across function calls in order to have a precise analysis, as
> described in the paper [1].
>     I do not plan to couple my analysis with the interval domain, but,
> if the LLVM community judge that it should be done, than I can
> implement my algorithm after the pentagon domain, proposed by Logozzo
> and Fahndrich in 2010 [2]. The pentagon domain is a different approach
> to get an analysis that is overflow resilient. It combines the
> interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than"
> domain. If some variable is bound to an interval like [l, u], where l
> -inf and u<  +inf, then we know that operations that add up to
> MAX_INT - u to it will not produce overflow. Same for the lower limit.
> I am reticent to use the pentagon domain because Logozzo et al report
> some relatively long runtimes (in the order of minutes) and so this
> approach might not attack the scalability issue that the SAFECode
> maintainers would like to handle.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include<stdlib.h>
> #include<stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. We would
> like to implement the detection of array bounds violation statically.
>
>
> Which results the analysis will generate
> -------------------------------------------
>
> This analysis will classify each array access site into one of two categories:
> (i) always safe
> (ii) possibly unsafe
>
> The locations classified as always safe will never be able to cause an
> array access violation, even if we take the wrapping semantics of
> integer arithmetics of C/C++.
>
> Additionally, I would like to implement an algorithm that inserts
> safety checks in some possibly unsafe locations. The less-than domain
> keeps, for each array, the variables that are equal to its allocated
> size. Thus, if this information is available in the sites that the
> array is used, we can insert a safety check there. For instance,
> consider the program below:
>
> #include<stdlib.h>
> #include<string.h>
> #include<stdio.h>
>
> char* alloca2(int size) {
>   return (char*) malloc(size * sizeof(char));
> }
>
> int main() {
>   char* c = alloca2(10);
>   char* x;
>   strcpy(c, "hello!");
>   x = c;
>   while (*x != '\0') {
>     putc(*x, stdout);
>     x++;
>   }
> }
>
> The inter-procedural analysis should be able to relate 10 with the
> size of array c. Then it could insert the following test in the code:
>
> int main() {
>   char* c = alloca2(10);
>   char* x;
>   strcpy(c, "hello!");
>   x = c;
>   while (*x != '\0') {
>     if (x>  c + 10) {
>       abort();
>     }
>     putc(*x, stdout);
>     x++;
>   }
> }
>
> It is not possible to secure statically every array access site, but
> we should be able to secure at least a good percentage of them.
>
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Implement the less-than relational domain.
> [5-7] Implement static array bounds checking. The analysis should
> point that a violation might happen in a given array access site.
> [8-10] Implement a diagnosis tool, that inserts access checks whenever
> we have the information of the limits of an array available.
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
> relational abstract domain for the efficient validation of array
> accesses. Sci. Comput. Program. 75(9): 796-807 (2010)
>
> [3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating
> array bounds checks on demand. PLDI 2000: 321-333