[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Tue Apr 3 07:54:38 PDT 2012

Dear LLVMers,

I wrote a new proposal, to improve the static array bounds checking in
SAFEcode, as follows:

Improving static array bounds checking in SAFEcode
==================================================

Objective
---------

the main objective of this project is to improve the static array
bounds checking engine used in SAFECode. It was written after the open
project at http://safecode.cs.illinois.edu/projects.html, which reads
as follows:

"Improve static array bounds checking: SAFECode used to have an
inter-procedural static array bounds checking pass, but it suffered
serious scalability issues and was eventually disabled. Improving the
old code or writing a new pass would be extremely beneficial."

I plan to build a new array bounds checking algorithm from scratch.

How it will be done
-------------------

I will build an array bounds checking algorithm based on the ABCD
algorithm that was described by Bodik in the paper [3]. The key
insight in the ABCD was to use a lattice that some researchers, e.g.,
[2], call the "less-than" lattice. Instead of determining precise
ranges for the variables using for instance, the interval lattice,
Bodik's less-than lattice finds, for each variable v, the set of other
variables that are less-than it.
   It is a requirement that the algorithm must be relient against the
wrapping semantics of 2-complement's arithmetic. The original
algorithm proposed by Bodik is not. However, it is possible to limit
Bodik's algorithm in such a way that only overflow resilient
equalities are taken into consideration. In other words, if the source
program contains a comparison such as:

if (x < y) {
 // p1
} else {
 // p2
}

  then we are guaranteed that inside p1 we have that x < y.

   I propose an analysis that, contrary to the original ABCD
algorithm is inter-procedural. To obtain context sensitiveness, which
is very important, I will initially use function inlining, which is
available in LLVM. If time allows it, I will implement true context
sensitiveness using an "on-demand" approach on the call graph of
functions. Context sensitiveness is remarkably important in this type
of domain because it is necessary to match array sizes and identifiers
across function calls in order to have a precise analysis, as
described in the paper [1].
   I do not plan to couple my analysis with the interval domain, but,
if the LLVM community judge that it should be done, than I can
implement my algorithm after the pentagon domain, proposed by Logozzo
and Fahndrich in 2010 [2]. The pentagon domain is a different approach
to get an analysis that is overflow resilient. It combines the
interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than"
domain. If some variable is bound to an interval like [l, u], where l
-inf and u < +inf, then we know that operations that add up to
MAX_INT - u to it will not produce overflow. Same for the lower limit.
I am reticent to use the pentagon domain because Logozzo et al report
some relatively long runtimes (in the order of minutes) and so this
approach might not attack the scalability issue that the SAFECode
maintainers would like to handle.

Background
----------

Invalid memory access is a major source of program failures. If a
program statement dereferences a pointer that points to an invalid
memory cell, the program is either aborted by the operating system or,
often worse, the program continues to run with an undefined behavior.
To avoid the latter, one can perform checks before every memory access
at run time. For some programming languages (e.g., Java) this is done
automatically by the compiler/run-time environment. For the language
C, neither the compiler nor the run-time environment enforces
memory-safety policies [1].

An example of access violation is given by the program below. When
this program runs, probably no visible error happens. However, the
program violates the semantics of well defined C code: it is accessing
the 11-th cell of an array that was declared with only 10 cells. Thus,
the behavior of this program is undefined.

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv){
      int i;
      int* A;
      A = (int*)calloc(10,sizeof(int));
      for(i=0;i<=11;i++)
              A[i] = i;
      printf("%d\n",A[11]);
      return 0;
}

This type of behavior can be the source of some very hard-to-find
bugs. If no explicit error occurs, other variables can have their
values changed without any reasonable explanation. Finding and fixing
this type of bug can be really hard, especially if the program runs
with more than one active thread. There exists tools that have been
built specially to track access violations like the one in the program
above. Valgrind is probably the best known example. These tools are
dynamic: they are used during the execution of the program, either
emulating it, or instrumenting it to catch errors at runtime. We would
like to implement the detection of array bounds violation statically.

Which results the analysis will generate
-------------------------------------------

This analysis will classify each array access site into one of two categories:
(i) always safe
(ii) possibly unsafe

The locations classified as always safe will never be able to cause an
array access violation, even if we take the wrapping semantics of
integer arithmetics of C/C++.

Additionally, I would like to implement an algorithm that inserts
safety checks in some possibly unsafe locations. The less-than domain
keeps, for each array, the variables that are equal to its allocated
size. Thus, if this information is available in the sites that the
array is used, we can insert a safety check there. For instance,
consider the program below:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

char* alloca2(int size) {
 return (char*) malloc(size * sizeof(char));
}

int main() {
 char* c = alloca2(10);
 char* x;
 strcpy(c, "hello!");
 x = c;
 while (*x != '\0') {
   putc(*x, stdout);
   x++;
 }
}

The inter-procedural analysis should be able to relate 10 with the
size of array c. Then it could insert the following test in the code:

int main() {
 char* c = alloca2(10);
 char* x;
 strcpy(c, "hello!");
 x = c;
 while (*x != '\0') {
   if (x > c + 10) {
     abort();
   }
   putc(*x, stdout);
   x++;
 }
}

It is not possible to secure statically every array access site, but
we should be able to secure at least a good percentage of them.

Timeline
--------

This is a three months project, and we would like to schedule the time
in the following way (schedule given in weeks):

[1-4]: Implement the less-than relational domain.
[5-7] Implement static array bounds checking. The analysis should
point that a violation might happen in a given array access site.
[8-10] Implement a diagnosis tool, that inserts access checks whenever
we have the information of the limits of an array available.
[11-12] Final tests and report
- Correctly compile the whole LLVM test suite.
- Generate performance numbers.
- Write the final report.

My background
-------------

I am currently a first year master student at the Federal University
of Minas Gerais, under orientation of professor Fernando Pereira. I
have graduated in this school on the year of 2011. I've worked with
software development for five years, in different projects. I built a
Data Warehouse viewer and Business Intelligence graphics and gauges
using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
a logistics planning system, in Excel VBA. I have five years of
experience with SQL. Now, I'm working with LLVM, writing passes to
support scientifical researchers.

Why I can do well in this project
---------------------------------

I believe that I am a good candidate for this Summer of Code for four
reasons. Firstly, I am a master student of computer science in the
Federal
University of Minas Gerais (UFMG), which is one of the best schools of
computer science in Brazil. To support this claim, notice that UFMG
has the maximum grade in the ranking of CAPES (Coordenação de
Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
under the Ministry of Education). I am working in the compiler's lab,
under the advice of Fernando Pereira, who was a Summer of Coder
himself in 2006. Second, I have good experience with programming, with
almost six years of experience in many languages (Java, C, C++, C#,
VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
LLVM passes. I already know some LLVM tools and I already have the
basic knowledge to start this project. I have written a pass to
instrument LLVM bitcodes, in order to find the minimum and maximum
values that each variable assumes at runtime. This code has been used
to check the precision of our range analysis. It is publicly available
at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
Finally, in the lab where I work there are six other people
who work everyday with LLVM; thus, in addition to getting help in the
forum, I can easily talk about my project to my colleagues.

References
----------

[1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
Springer-Verlag, 2005

[2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
relational abstract domain for the efficient validation of array
accesses. Sci. Comput. Program. 75(9): 796-807 (2010)

[3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating
array bounds checks on demand. PLDI 2000: 321-333