[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Wed Apr 4 17:21:50 PDT 2012

On 4/3/12 9:54 AM, Raphael Ernani Rodrigues wrote:
> Dear LLVMers,
>
> I wrote a new proposal, to improve the static array bounds checking in
> SAFEcode, as follows:
>
>
>
> Improving static array bounds checking in SAFEcode
> ==================================================
>
>
> Objective
> ---------
>
> the main objective of this project is to improve the static array
> bounds checking engine used in SAFECode. It was written after the open
> project athttp://safecode.cs.illinois.edu/projects.html, which reads
> as follows:
>
> "Improve static array bounds checking: SAFECode used to have an
> inter-procedural static array bounds checking pass, but it suffered
> serious scalability issues and was eventually disabled. Improving the
> old code or writing a new pass would be extremely beneficial."
>
> I plan to build a new array bounds checking algorithm from scratch.

I think what you want to say here is that you're going to build a new 
static array bounds checking pass for SAFECode/LLVM but that the code 
may also be useful for LLVM sub-projects like VMKit and DragonEgg's Ada 
support.

As far as organization goes,  think you should provide the background 
section first and then describe how you will solve the problem.

> How it will be done
> -------------------
>
> I will build an array bounds checking algorithm based on the ABCD
> algorithm that was described by Bodik in the paper [3]. The key
> insight in the ABCD was to use a lattice that some researchers, e.g.,
> [2], call the "less-than" lattice. Instead of determining precise
> ranges for the variables using for instance, the interval lattice,
> Bodik's less-than lattice finds, for each variable v, the set of other
> variables that are less-than it.
>     It is a requirement that the algorithm must be relient against the

Replace reliant with resilient.

> wrapping semantics of 2-complement's arithmetic. The original
> algorithm proposed by Bodik is not. However, it is possible to limit
> Bodik's algorithm in such a way that only overflow resilient
> equalities are taken into consideration. In other words, if the source
> program contains a comparison such as:
>
> if (x<  y) {
>   // p1
> } else {
>   // p2
> }
>
>    then we are guaranteed that inside p1 we have that x<  y.

This example is not clear, does not demonstrate integer overflow, and is 
not related to memory safety.  I would use an example like this:

p = malloc ((x * sizeof (struct foo)) + 1);
p[x] =  ...

It is not necessarily the case that 0 <= x <= ((x * sizeof (struct foo)) 
+ 1).  Therefore, this array access, which looks safe at first glance, 
may not be safe.

>     I propose an analysis that, contrary to the original ABCD

s/contrary to/unlike

> algorithm is inter-procedural. To obtain context sensitiveness, which

s/sensitiveness/sensitivity
> is very important, I will initially use function inlining, which is
> available in LLVM. If time allows it, I will implement true context

If time permits

> sensitiveness using an "on-demand" approach on the call graph of
> functions. Context sensitiveness is remarkably important in this type
> of domain because it is necessary to match array sizes and identifiers
> across function calls in order to have a precise analysis, as
> described in the paper [1].
>     I do not plan to couple my analysis with the interval domain, but,
> if the LLVM community judge that it should be done, than I can
> implement my algorithm after the pentagon domain, proposed by Logozzo
> and Fahndrich in 2010 [2]. The pentagon domain is a different approach
> to get an analysis that is overflow resilient. It combines the
> interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than"
> domain. If some variable is bound to an interval like [l, u], where l
> -inf and u<  +inf, then we know that operations that add up to
> MAX_INT - u to it will not produce overflow. Same for the lower limit.
> I am reticent to use the pentagon domain because Logozzo et al report
> some relatively long runtimes (in the order of minutes) and so this
> approach might not attack the scalability issue that the SAFECode
> maintainers would like to handle.

I think you should remove the last paragraph; it doesn't seem relevant.  
Chances are good that you know more about these static analyses than the 
judges do.  There's no point in letting them select which algorithm you 
implement.

>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.

You can make an even stronger argument: out of bounds pointers are used 
to subvert system security.

> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include<stdlib.h>
> #include<stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. We would
> like to implement the detection of array bounds violation statically.

First, Valgrind does not do bounds checking in its default mode; it 
merely finds invalid loads and stores.  Its ptrcheck tool does bounds 
checking but is slow and can sometimes be unreliable.

Second, you should actually mention that tools for dynamic array bounds 
checking do exist (primarily SAFECode and SoftBound).  You should 
emphasize that you're trying to optimize the speed at which 
SAFECode/SoftBound instrumented programs can run by optimizing away 
run-time array bounds checks.

Third, I don't think you need a C code example to demonstrate your 
point.  Everyone knows what a buffer overflow is.  I think you want to 
argue that buffer overflows cause security vulnerabilities and 
hard-to-fix bugs, there are tools that catch these errors at run-time, 
and you're going to try to optimize the run-time checks by using static 
array bounds checking.

Fourth, I think you should point out that your work can easily be 
modified to work with other tools such as VMKit, SoftBound, and the 
Dragonegg Ada front-end even if it targets SAFECode first.

>
> Which results the analysis will generate
> -------------------------------------------
>
> This analysis will classify each array access site into one of two categories:
> (i) always safe
> (ii) possibly unsafe
>
> The locations classified as always safe will never be able to cause an
> array access violation, even if we take the wrapping semantics of
> integer arithmetics of C/C++.
>
> Additionally, I would like to implement an algorithm that inserts
> safety checks in some possibly unsafe locations. The less-than domain
> keeps, for each array, the variables that are equal to its allocated
> size. Thus, if this information is available in the sites that the
> array is used, we can insert a safety check there. For instance,
> consider the program below:

As I mentioned before, SAFECode and SoftBound already insert the 
checks.  Your analysis would be used to remove those that are unneeded.

>
> #include<stdlib.h>
> #include<string.h>
> #include<stdio.h>
>
> char* alloca2(int size) {
>   return (char*) malloc(size * sizeof(char));
> }
>
> int main() {
>   char* c = alloca2(10);
>   char* x;
>   strcpy(c, "hello!");
>   x = c;
>   while (*x != '\0') {
>     putc(*x, stdout);
>     x++;
>   }
> }
>
> The inter-procedural analysis should be able to relate 10 with the
> size of array c. Then it could insert the following test in the code:
>
> int main() {
>   char* c = alloca2(10);
>   char* x;
>   strcpy(c, "hello!");
>   x = c;
>   while (*x != '\0') {
>     if (x>  c + 10) {
>       abort();
>     }
>     putc(*x, stdout);
>     x++;
>   }
> }
>
> It is not possible to secure statically every array access site, but
> we should be able to secure at least a good percentage of them.

Remove the code example and the last sentence.

>
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Implement the less-than relational domain.

Will you be reusing infrastructure from your research group?  If so, 
make sure to say that.  Also specify the open-source license under which 
that infrastructure is released.  Finally, be sure to explain how that 
infrastructure is going to help you build the less-than analysis more 
quickly.
> [5-7] Implement static array bounds checking. The analysis should
> point that a violation might happen in a given array access site.
> [8-10] Implement a diagnosis tool, that inserts access checks whenever
> we have the information of the limits of an array available.

You could do that, but then you're not helping make SAFECode faster.  
Your static array bounds checking pass should be used to remove run-time 
checks on GEPs that your analysis shows are always safe.

> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011.

I assume you have graduated with a Bachelor's.  If so, you should state 
that explicitly.

Also, it's "in the year" and not "on the year."

>   I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.

s/scientifical researchers/scientific research

There are two problems with the above paragraph:

1) Your work in industry doesn't really argue for your strengths.  Can 
you tell us how large these programs are?  Are they still used today?  
Providing a few key details can make this stronger.

2) You don't highlight your C++ and LLVM programming skills, and nothing 
in your experience suggests that you have a strong compiler background.  
Can you explain more about what work you've done on the range analysis 
with LLVM?  Do you have any other C++ programming experience?

>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,

"working in the compiler lab"

> under the advice of Fernando Pereira, who was a Summer of Coder

"under the supervision"

> himself in 2006.

"who was a Google Summer of Code participant in 2006."

>   Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime.

Did this pass find the dynamic minimum and maximum values when the 
program was executed?

>   This code has been used
> to check the precision of our range analysis. It is publicly available
> at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.

I would drop the last sentence.  Having smart colleagues is great, but 
it's a weak argument when trying to convince someone to hire you.

>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
> relational abstract domain for the efficient validation of array
> accesses. Sci. Comput. Program. 75(9): 796-807 (2010)
>
> [3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating
> array bounds checks on demand. PLDI 2000: 321-333

You didn't provide any references for SAFECode.  See 
http://sva.cs.illinois.edu/pubs.html.

-- John T.