[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
John Criswell
criswell at illinois.edu
Fri Mar 30 11:49:53 PDT 2012
On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:
> Dear LLVMers,
>
> My name is Raphael Ernani, and I am doing my MsC at the Federal
> University of Minas Gerais, Brazil. I have been using LLVM for a
> while, and I would like to participate in this year's Summer of Code.
> One particular idea, in your "open projects" page caught my eye, and I
> decided to write a proposal about it. The line that I liked in the
> page was "Create an LLVM pass that adds memory safety checks to code,
> like Valgrind does for binaries, or like mudflap does for gcc compiled
> code.", and my proposal is written below:
Hi Raphael!
I haven't read your proposal in detail, but there are actually three
projects that implement memory safety by transforming LLVM bitcode:
1) ASAN is integrated into LLVM. It is meant as a low-overhead
debugging tool in the spirit of Valgrind. It's primary technique is to
place guard memory in between memory objects and check loads and stores
to see if they return a poisoned bit-patter from a guard. It isn't
sound (certain dangling pointer errors and array bound violations can
get past it), but it's easy to implement.
2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to
provide sound memory safety checks and is a sub-project of LLVM. It
inserts function pointer checks, precise structure and array bounds
checks, load/store checks, and checks on C library function calls into
code. It is also designed to work with automatic pool allocation to
provide sound points-to analysis and partial type-safety, but these
features are currently disabled because they are not sufficiently
robust. While initially designed to protect applications during
production, SAFECode has a generic pass to add debug information to its
run-time checks, essentially make it a valgrind replacement like ASAN.
3) SoftBound and its CETS extension
(http://www.cis.upenn.edu/acg/softbound) have been integrated into the
SAFECode compiler and can be enabled with options in SAFECode's clang.
It provides array bounds checking and, with CETS, optional dangling
pointer detection.
While I don't think we need another memory safety checker for LLVM, I do
think that there's *a lot* of interesting projects in making each of
these tools better. Some ideas that come to mind:
o Creating common instrumentation passes for ASAN, SAFECode, and
SoftBound to permit code reuse. All three essentially place checks on
loads, stores, and atomic operations. If they use the same run-time
check names, then the same optimizations can improve the speed of all three.
o Creating or improving optimizations for memory safety checks. Static
array bounds checking alone provides numerous options, such as
implementing ABCD, using value-range analysis (mentioned in another GSoC
post), investigating SMT solver-based techniques, etc, etc. Other
optimizations include hoisting checks out of loops, polishing and
re-enabling the SAFECode type-safety optimization, and removing
redundant load/store checks.
o Polishing and enabling the Baggy Bounds Checking (BBC) run-time. This
run-time can speed up the run-time checks and should be faster than the
splay-tree approach that SAFECode currently uses.
I'll go ahead and create a list of open projects on the SAFECode web
page (http://sva.cs.illinois.edu), and I'll remove the "Valgrind" tool
project off the open projects page as there are now several memory
safety tools built using LLVM. If you want my opinion, I think the
static array bounds checker or the monotonic loop optimization make
nice, self-contained projects.
Finally, you might be interested in the Memory Safety Menagerie
(http://sva.cs.illinois.edu/menagerie/). This web page contains a whole
list of papers on the subject of memory safety transforms.
-- John T.
>
>
> ================================================
> Adding memory safety checks to the LLVM bitcodes
> ================================================
>
>
> Objective
> ---------
>
> The objective of this project is to create an LLVM pass that adds
> memory safety checks to code, like Valgrind does for binaries, or like
> mudflap does for gcc compiled code.
>
>
> How it will be done
> -------------------
>
> We will achieve our goal by creating a LLVM pass that does the following:
> 1 - Propagate the array size to the sites in the program where the
> array is accessed.
> 2 - Insert code in the program to make a bound-check and throw an
> exception if the program tries to access a position out of the array
> bounds.
> 3 - Eliminate redundant bound-checks.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
> int i;
> int* A;
> A = (int*)calloc(10,sizeof(int));
> for(i=0;i<=11;i++)
> A[i] = i;
> printf("%d\n",A[11]);
> return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. Tools
> like Valgrind and mudflap are very useful; however, LLVM does not have
> any similar function yet. Our goal is to help filling this gap.
>
>
> How we plan to improve the code
> -------------------------------
>
> Before every read or write in a position of an array, our pass will
> add instructions to check if the index of the array is outside the
> bounds.
> The secured program, without optimizations, is shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
> int i, tmp0;
> int* A;
> A = (int*)calloc(10,sizeof(int));
> for(i=0;i<=11;i++){
> if (i<0 or i>10) raise("Array out of bounds.");
> A[i] = i;
> }
> tmp0 = 11;
> if (tmp0<0 or tmp0>10) raise("Array out of bounds.");
> printf("%d\n",A[11]);
> return 0;
> }
>
> In order to accomplish this object, the new LLVM pass will have to
> perform a number of actions:
> 1) Collect the allocation sizes of each array. This can be done at
> runtime, simply reading the arguments passed to the allocation
> function.
> 2) Store the size of the array together with the allocated area. This
> can also be done dynamically, increasing in 32 bits the size of the
> allocated area that is given to the array base address.
> 3) Prefix any array access code with a bounds check to ensure that the
> used index is inside the correct limits.
>
> After this, to improve the performance of the "safe programs", I can
> remove redundant checks, by not adding new checks where already exist
> a safety check. I can use two different abstract domains to accomplish
> this goal, a simpler, and another more complex one:
>
> 1) The domain of "less-than" names. This is the domain used in the
> ABCD array bounds check algorithm [2]. For each variable v is gives
> the set of variables that are provably less than v. It is relatively
> simple - in terms of implementation - to compute this domain; however,
> this computation is costly: it is cubic on the number of variables in
> the program. Nevertheless, I can implement the domain, to see if it
> pays off its costs. There was an implementation of ABCD in LLVM 2.7,
> and I can reuse a substantial part of that code.
>
> 2) The domain of intervals. For each variable v, this domain gives an
> interval [l, u] with the minimum and maximum values that v can assume
> during the program execution. I can use the implementation of range
> analysis that we have developed in the Federal University of Minas
> Gerais to compute this domain. The implementation is quite mature now,
> and I was one of the developers that worked on it.
>
> Linked programs
> ----------------
>
> In order to be effective, the instrumentation pass should be a whole
> program analysis. However, even in this case, only the code that has
> been compiled with the pass will be secure. It is possible that
> library calls still lead to undefined behavior, like in the program
> below:
>
> // We do not have access to this code, because it is an external library:
> void foo(int* A, int i){
> A[i] = i;
> }
>
> // We have secured this code with our instrumentation:
> #include <stdlib.h>
> #include <stdio.h>
> #include "A.h"
> int main(int argc, char** argv){
> int i;
> int* A;
> A = (int*)calloc(10,sizeof(int));
> for(i=0;i<=11;i++){
> foo(A,i);
> }
> printf("%d\n",A[11]);
> return 0;
> }
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Change the allocation scheme used by LLVM to store the array
> size next to the area that is allocated to the array.
>
> [5-6] Write the pass to insert access checks before array accesses.
>
> [7-10] Improve the performance of safe programs.
> - Avoid inserting bound-checks where already exists a memory check.
> - Avoid inserting bound-checks where the "less-than" domain indicates
> that the memory access is safe.
>
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at
> (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
>
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
> and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/cc95419e/attachment.html>
More information about the llvm-dev
mailing list