[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Fri Mar 30 11:08:49 PDT 2012

Dear LLVMers,

   My name is Raphael Ernani, and I am doing my MsC at the Federal
University of Minas Gerais, Brazil. I have been using LLVM for a
while, and I would like to participate in this year's Summer of Code.
One particular idea, in your "open projects" page caught my eye, and I
decided to write a proposal about it. The line that I liked in the
page was "Create an LLVM pass that adds memory safety checks to code,
like Valgrind does for binaries, or like mudflap does for gcc compiled
code.", and my proposal is written below:

================================================
Adding memory safety checks to the LLVM bitcodes
================================================

Objective
---------

The objective of this project is to create an LLVM pass that adds
memory safety checks to code, like Valgrind does for binaries, or like
mudflap does for gcc compiled code.

How it will be done
-------------------

We will achieve our goal by creating a LLVM pass that does the following:
1 - Propagate the array size to the sites in the program where the
array is accessed.
2 - Insert code in the program to make a bound-check and throw an
exception if the program tries to access a position out of the array
bounds.
3 - Eliminate redundant bound-checks.

Background
----------

Invalid memory access is a major source of program failures. If a
program statement dereferences a pointer that points to an invalid
memory cell, the program is either aborted by the operating system or,
often worse, the program continues to run with an undefined behavior.
To avoid the latter, one can perform checks before every memory access
at run time. For some programming languages (e.g., Java) this is done
automatically by the compiler/run-time environment. For the language
C, neither the compiler nor the run-time environment enforces
memory-safety policies [1].

An example of access violation is given by the program below. When
this program runs, probably no visible error happens. However, the
program violates the semantics of well defined C code: it is accessing
the 11-th cell of an array that was declared with only 10 cells. Thus,
the behavior of this program is undefined.

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv){
       int i;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++)
               A[i] = i;
       printf("%d\n",A[11]);
       return 0;
}

This type of behavior can be the source of some very hard-to-find
bugs. If no explicit error occurs, other variables can have their
values changed without any reasonable explanation. Finding and fixing
this type of bug can be really hard, especially if the program runs
with more than one active thread. There exists tools that have been
built specially to track access violations like the one in the program
above. Valgrind is probably the best known example. These tools are
dynamic: they are used during the execution of the program, either
emulating it, or instrumenting it to catch errors at runtime. Tools
like Valgrind and mudflap are very useful; however, LLVM does not have
any similar function yet. Our goal is to help filling this gap.

How we plan to improve the code
-------------------------------

Before every read or write in a position of an array, our pass will
add instructions to check if the index of the array is outside the
bounds.
The secured program, without optimizations, is shown below:

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv){
       int i, tmp0;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++){
               if (i<0 or i>10) raise("Array out of bounds.");
               A[i] = i;
       }
       tmp0 = 11;
       if (tmp0<0 or tmp0>10) raise("Array out of bounds.");
       printf("%d\n",A[11]);
       return 0;
}

In order to accomplish this object, the new LLVM pass will have to
perform a number of actions:
1) Collect the allocation sizes of each array. This can be done at
runtime, simply reading the arguments passed to the allocation
function.
2) Store the size of the array together with the allocated area. This
can also be done dynamically, increasing in 32 bits the size of the
allocated area that is given to the array base address.
3) Prefix any array access code with a bounds check to ensure that the
used index is inside the correct limits.

After this, to improve the performance of the "safe programs", I can
remove redundant checks, by not adding new checks where already exist
a safety check. I can use two different abstract domains to accomplish
this goal, a simpler, and another more complex one:

1) The domain of "less-than" names. This is the domain used in the
ABCD array bounds check algorithm [2]. For each variable v is gives
the set of variables that are provably less than v. It is relatively
simple - in terms of implementation - to compute this domain; however,
this computation is costly: it is cubic on the number of variables in
the program. Nevertheless, I can implement the domain, to see if it
pays off its costs. There was an implementation of ABCD in LLVM 2.7,
and I can reuse a substantial part of that code.

2) The domain of intervals. For each variable v, this domain gives an
interval [l, u] with the minimum and maximum values that v can assume
during the program execution. I can use the implementation of range
analysis that we have developed in the Federal University of Minas
Gerais to compute this domain. The implementation is quite mature now,
and I was one of the developers that worked on it.

Linked programs
----------------

In order to be effective, the instrumentation pass should be a whole
program analysis. However, even in this case, only the code that has
been compiled with the pass will be secure. It is possible that
library calls still lead to undefined behavior, like in the program
below:

// We do not have access to this code, because it is an external library:
void foo(int* A, int i){
       A[i] = i;
}

// We have secured this code with our instrumentation:
#include <stdlib.h>
#include <stdio.h>
#include "A.h"
int main(int argc, char** argv){
       int i;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++){
               foo(A,i);
       }
       printf("%d\n",A[11]);
       return 0;
}

Timeline
--------

This is a three months project, and we would like to schedule the time
in the following way (schedule given in weeks):

[1-4]: Change the allocation scheme used by LLVM to store the array
size next to the area that is allocated to the array.

[5-6] Write the pass to insert access checks before array accesses.

[7-10] Improve the performance of safe programs.
- Avoid inserting bound-checks where already exists a memory check.
- Avoid inserting bound-checks where the "less-than" domain indicates
that the memory access is safe.

[11-12] Final tests and report
- Correctly compile the whole LLVM test suite.
- Generate performance numbers.
- Write the final report.

My background
-------------

I am currently a first year master student at the Federal University
of Minas Gerais, under orientation of professor Fernando Pereira. I
have graduated in this school on the year of 2011. I've worked with
software development for five years, in different projects. I built a
Data Warehouse viewer and Business Intelligence graphics and gauges
using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
a logistics planning system, in Excel VBA. I have five years of
experience with SQL. Now, I'm working with LLVM, writing passes to
support scientifical researchers.

Why I can do well in this project
---------------------------------

I believe that I am a good candidate for this Summer of Code for four
reasons. Firstly, I am a master student of computer science in the
Federal
University of Minas Gerais (UFMG), which is one of the best schools of
computer science in Brazil. To support this claim, notice that UFMG
has the maximum grade in the ranking of CAPES (Coordenação de
Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
under the Ministry of Education). I am working in the compiler's lab,
under the advice of Fernando Pereira, who was a Summer of Coder
himself in 2006. Second, I have good experience with programming, with
almost six years of experience in many languages (Java, C, C++, C#,
VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
LLVM passes. I already know some LLVM tools and I already have the
basic knowledge to start this project. I have written a pass to
instrument LLVM bitcodes, in order to find the minimum and maximum
values that each variable assumes at runtime. This code has been used
to check the precision of our range analysis. It is publicly available
at (
http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation
).
Finally, in the lab where I work there are six other people
who work everyday with LLVM; thus, in addition to getting help in the
forum, I can easily talk about my project to my colleagues.

References
----------

[1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
Springer-Verlag, 2005

[2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/f1bc10b9/attachment.html>