<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:
<blockquote
cite="mid:CAKbogR0VYmKeHFgfoD880J+aAB452O_kYSNpeBCpvRRA8kyO3w@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<span style="">Dear LLVMers,</span><br style="">
<br style="">
<span style=""> My name is Raphael Ernani, and I am doing my MsC
at the Federal</span><br style="">
<span style="">University of Minas Gerais, Brazil. I have been
using LLVM for a</span><br style="">
<span style="">while, and I would like to participate in this
year's Summer of Code.</span><br style="">
<span style="">One particular idea, in your "open projects" page
caught my eye, and I</span><br style="">
<span style="">decided to write a proposal about it. The line that
I liked in the</span><br style="">
<span style="">page was "Create an LLVM pass that adds memory
safety checks to code,</span><br style="">
<span style="">like Valgrind does for binaries, or like mudflap
does for gcc compiled</span><br style="">
<span style="">code.", and my proposal is written below:</span></blockquote>
<br>
Hi Raphael!<br>
<br>
I haven't read your proposal in detail, but there are actually three
projects that implement memory safety by transforming LLVM bitcode:<br>
<br>
1) ASAN is integrated into LLVM. It is meant as a low-overhead
debugging tool in the spirit of Valgrind. It's primary technique is
to place guard memory in between memory objects and check loads and
stores to see if they return a poisoned bit-patter from a guard. It
isn't sound (certain dangling pointer errors and array bound
violations can get past it), but it's easy to implement.<br>
<br>
2) SAFECode (my project: <a class="moz-txt-link-freetext" href="http://safecode.cs.illinois.edu">http://safecode.cs.illinois.edu</a>) aims to
provide sound memory safety checks and is a sub-project of LLVM. It
inserts function pointer checks, precise structure and array bounds
checks, load/store checks, and checks on C library function calls
into code. It is also designed to work with automatic pool
allocation to provide sound points-to analysis and partial
type-safety, but these features are currently disabled because they
are not sufficiently robust. While initially designed to protect
applications during production, SAFECode has a generic pass to add
debug information to its run-time checks, essentially make it a
valgrind replacement like ASAN.<br>
<br>
3) SoftBound and its CETS extension
(<a class="moz-txt-link-freetext" href="http://www.cis.upenn.edu/acg/softbound">http://www.cis.upenn.edu/acg/softbound</a>) have been integrated into
the SAFECode compiler and can be enabled with options in SAFECode's
clang. It provides array bounds checking and, with CETS, optional
dangling pointer detection.<br>
<br>
While I don't think we need another memory safety checker for LLVM,
I do think that there's *a lot* of interesting projects in making
each of these tools better. Some ideas that come to mind:<br>
<br>
o Creating common instrumentation passes for ASAN, SAFECode, and
SoftBound to permit code reuse. All three essentially place checks
on loads, stores, and atomic operations. If they use the same
run-time check names, then the same optimizations can improve the
speed of all three.<br>
<br>
o Creating or improving optimizations for memory safety checks.
Static array bounds checking alone provides numerous options, such
as implementing ABCD, using value-range analysis (mentioned in
another GSoC post), investigating SMT solver-based techniques, etc,
etc. Other optimizations include hoisting checks out of loops,
polishing and re-enabling the SAFECode type-safety optimization, and
removing redundant load/store checks.<br>
<br>
o Polishing and enabling the Baggy Bounds Checking (BBC) run-time.
This run-time can speed up the run-time checks and should be faster
than the splay-tree approach that SAFECode currently uses.<br>
<br>
I'll go ahead and create a list of open projects on the SAFECode web
page (<a class="moz-txt-link-freetext" href="http://sva.cs.illinois.edu">http://sva.cs.illinois.edu</a>), and I'll remove the "Valgrind"
tool project off the open projects page as there are now several
memory safety tools built using LLVM. If you want my opinion, I
think the static array bounds checker or the monotonic loop
optimization make nice, self-contained projects.<br>
<br>
Finally, you might be interested in the Memory Safety Menagerie
(<a class="moz-txt-link-freetext" href="http://sva.cs.illinois.edu/menagerie/">http://sva.cs.illinois.edu/menagerie/</a>). This web page contains a
whole list of papers on the subject of memory safety transforms.<br>
<br>
-- John T.<br>
<br>
<blockquote
cite="mid:CAKbogR0VYmKeHFgfoD880J+aAB452O_kYSNpeBCpvRRA8kyO3w@mail.gmail.com"
type="cite">
<div><span style=""><br>
</span></div>
<div><span style=""><br>
</span></div>
<div><span style="">==============================</span><span
style="">==================</span> <br style="">
<span style="">Adding memory safety checks to the LLVM bitcodes</span><br
style="">
<span style="">==============================</span><span
style="">==================</span><br style="">
<br style="">
<br style="">
<span style="">Objective</span><br style="">
<span style="">---------</span><br style="">
<br style="">
<span style="">The objective of this project is to create an
LLVM pass that adds</span><br style="">
<span style="">memory safety checks to code, like Valgrind does
for binaries, or like</span><br style="">
<span style="">mudflap does for gcc compiled code.</span><br
style="">
<br style="">
<br style="">
<span style="">How it will be done</span><br style="">
<span style="">-------------------</span><br style="">
<br style="">
<span style="">We will achieve our goal by creating a LLVM pass
that does the following:</span><br style="">
<span style="">1 - Propagate the array size to the sites in the
program where the</span><br style="">
<span style="">array is accessed.</span><br style="">
<span style="">2 - Insert code in the program to make a
bound-check and throw an</span><br style="">
<span style="">exception if the program tries to access a
position out of the array</span><br style="">
<span style="">bounds.</span><br style="">
<span style="">3 - Eliminate redundant bound-checks.</span><br
style="">
<br style="">
<br style="">
<span style="">Background</span><br style="">
<span style="">----------</span><br style="">
<br style="">
<span style="">Invalid memory access is a major source of
program failures. If a</span><br style="">
<span style="">program statement dereferences a pointer that
points to an invalid</span><br style="">
<span style="">memory cell, the program is either aborted by the
operating system or,</span><br style="">
<span style="">often worse, the program continues to run with an
undefined behavior.</span><br style="">
<span style="">To avoid the latter, one can perform checks
before every memory access</span><br style="">
<span style="">at run time. For some programming languages
(e.g., Java) this is done</span><br style="">
<span style="">automatically by the compiler/run-time
environment. For the language</span><br style="">
<span style="">C, neither the compiler nor the run-time
environment enforces</span><br style="">
<span style="">memory-safety policies [1].</span><br style="">
<br style="">
<span style="">An example of access violation is given by the
program below. When</span><br style="">
<span style="">this program runs, probably no visible error
happens. However, the</span><br style="">
<span style="">program violates the semantics of well defined C
code: it is accessing</span><br style="">
<span style="">the 11-th cell of an array that was declared with
only 10 cells. Thus,</span><br style="">
<span style="">the behavior of this program is undefined.</span><br
style="">
<br style="">
<span style="">#include <stdlib.h></span><br style="">
<span style="">#include <stdio.h></span><br style="">
<span style="">int main(int argc, char** argv){</span><br
style="">
<span style=""> int i;</span><br style="">
<span style=""> int* A;</span><br style="">
<span style=""> A = (int*)calloc(10,sizeof(int));</span><br
style="">
<span style=""> for(i=0;i<=11;i++)</span><br style="">
<span style=""> A[i] = i;</span><br style="">
<span style=""> printf("%d\n",A[11]);</span><br style="">
<span style=""> return 0;</span><br style="">
<span style="">}</span><br style="">
<br style="">
<span style="">This type of behavior can be the source of some
very hard-to-find</span><br style="">
<span style="">bugs. If no explicit error occurs, other
variables can have their</span><br style="">
<span style="">values changed without any reasonable
explanation. Finding and fixing</span><br style="">
<span style="">this type of bug can be really hard, especially
if the program runs</span><br style="">
<span style="">with more than one active thread. There exists
tools that have been</span><br style="">
<span style="">built specially to track access violations like
the one in the program</span><br style="">
<span style="">above. Valgrind is probably the best known
example. These tools are</span><br style="">
<span style="">dynamic: they are used during the execution of
the program, either</span><br style="">
<span style="">emulating it, or instrumenting it to catch errors
at runtime. Tools</span><br style="">
<span style="">like Valgrind and mudflap are very useful;
however, LLVM does not have</span><br style="">
<span style="">any similar function yet. Our goal is to help
filling this gap.</span><br style="">
<br style="">
<br style="">
<span style="">How we plan to improve the code</span><br
style="">
<span style="">------------------------------</span><span
style="">-</span><br style="">
<br style="">
<span style="">Before every read or write in a position of an
array, our pass will</span><br style="">
<span style="">add instructions to check if the index of the
array is outside the</span><br style="">
<span style="">bounds.</span><br style="">
<span style="">The secured program, without optimizations, is
shown below:</span><br style="">
<br style="">
<span style="">#include <stdlib.h></span><br style="">
<span style="">#include <stdio.h></span><br style="">
<span style="">int main(int argc, char** argv){</span><br
style="">
<span style=""> int i, tmp0;</span><br style="">
<span style=""> int* A;</span><br style="">
<span style=""> A = (int*)calloc(10,sizeof(int));</span><br
style="">
<span style=""> for(i=0;i<=11;i++){</span><br style="">
<span style=""> if (i<0 or i>10)
raise("Array out of bounds.");</span><br style="">
<span style=""> A[i] = i;</span><br style="">
<span style=""> }</span><br style="">
<span style=""> tmp0 = 11;</span><br style="">
<span style=""> if (tmp0<0 or tmp0>10) raise("Array
out of bounds.");</span><br style="">
<span style=""> printf("%d\n",A[11]);</span><br style="">
<span style=""> return 0;</span><br style="">
<span style="">}</span><br style="">
<br style="">
<span style="">In order to accomplish this object, the new LLVM
pass will have to</span><br style="">
<span style="">perform a number of actions:</span><br style="">
<span style="">1) Collect the allocation sizes of each array.
This can be done at</span><br style="">
<span style="">runtime, simply reading the arguments passed to
the allocation</span><br style="">
<span style="">function.</span><br style="">
<span style="">2) Store the size of the array together with the
allocated area. This</span><br style="">
<span style="">can also be done dynamically, increasing in 32
bits the size of the</span><br style="">
<span style="">allocated area that is given to the array base
address.</span><br style="">
<span style="">3) Prefix any array access code with a bounds
check to ensure that the</span><br style="">
<span style="">used index is inside the correct limits.</span><br
style="">
<br style="">
<span style="">After this, to improve the performance of the
"safe programs", I can</span><br style="">
<span style="">remove redundant checks, by not adding new checks
where already exist</span><br style="">
<span style="">a safety check. I can use two different abstract
domains to accomplish</span><br style="">
<span style="">this goal, a simpler, and another more complex
one:</span><br style="">
<br style="">
<span style="">1) The domain of "less-than" names. This is the
domain used in the</span><br style="">
<span style="">ABCD array bounds check algorithm [2]. For each
variable v is gives</span><br style="">
<span style="">the set of variables that are provably less than
v. It is relatively</span><br style="">
<span style="">simple - in terms of implementation - to compute
this domain; however,</span><br style="">
<span style="">this computation is costly: it is cubic on the
number of variables in</span><br style="">
<span style="">the program. Nevertheless, I can implement the
domain, to see if it</span><br style="">
<span style="">pays off its costs. There was an implementation
of ABCD in LLVM 2.7,</span><br style="">
<span style="">and I can reuse a substantial part of that code.</span><br
style="">
<br style="">
<span style="">2) The domain of intervals. For each variable v,
this domain gives an</span><br style="">
<span style="">interval [l, u] with the minimum and maximum
values that v can assume</span><br style="">
<span style="">during the program execution. I can use the
implementation of range</span><br style="">
<span style="">analysis that we have developed in the Federal
University of Minas</span><br style="">
<span style="">Gerais to compute this domain. The implementation
is quite mature now,</span><br style="">
<span style="">and I was one of the developers that worked on
it.</span><br style="">
<br style="">
<span style="">Linked programs</span><br style="">
<span style="">----------------</span><br style="">
<br style="">
<span style="">In order to be effective, the instrumentation
pass should be a whole</span><br style="">
<span style="">program analysis. However, even in this case,
only the code that has</span><br style="">
<span style="">been compiled with the pass will be secure. It is
possible that</span><br style="">
<span style="">library calls still lead to undefined behavior,
like in the program</span><br style="">
<span style="">below:</span><br style="">
<br style="">
<span style="">// We do not have access to this code, because it
is an external library:</span><br style="">
<span style="">void foo(int* A, int i){</span><br style="">
<span style=""> A[i] = i;</span><br style="">
<span style="">}</span><br style="">
<br style="">
<span style="">// We have secured this code with our
instrumentation:</span><br style="">
<span style="">#include <stdlib.h></span><br style="">
<span style="">#include <stdio.h></span><br style="">
<span style="">#include "A.h"</span><br style="">
<span style="">int main(int argc, char** argv){</span><br
style="">
<span style=""> int i;</span><br style="">
<span style=""> int* A;</span><br style="">
<span style=""> A = (int*)calloc(10,sizeof(int));</span><br
style="">
<span style=""> for(i=0;i<=11;i++){</span><br style="">
<span style=""> foo(A,i);</span><br style="">
<span style=""> }</span><br style="">
<span style=""> printf("%d\n",A[11]);</span><br style="">
<span style=""> return 0;</span><br style="">
<span style="">}</span><br style="">
<br style="">
<span style="">Timeline</span><br style="">
<span style="">--------</span><br style="">
<br style="">
<span style="">This is a three months project, and we would like
to schedule the time</span><br style="">
<span style="">in the following way (schedule given in weeks):</span><br
style="">
<br style="">
<span style="">[1-4]: Change the allocation scheme used by LLVM
to store the array</span><br style="">
<span style="">size next to the area that is allocated to the
array.</span><br style="">
<br style="">
<span style="">[5-6] Write the pass to insert access checks
before array accesses.</span><br style="">
<br style="">
<span style="">[7-10] Improve the performance of safe programs.</span><br
style="">
<span style="">- Avoid inserting bound-checks where already
exists a memory check.</span><br style="">
<span style="">- Avoid inserting bound-checks where the
"less-than" domain indicates</span><br style="">
<span style="">that the memory access is safe.</span><br
style="">
<br style="">
<span style="">[11-12] Final tests and report</span><br style="">
<span style="">- Correctly compile the whole LLVM test suite.</span><br
style="">
<span style="">- Generate performance numbers.</span><br
style="">
<span style="">- Write the final report.</span><br style="">
<br style="">
<br style="">
<span style="">My background</span><br style="">
<span style="">-------------</span><br style="">
<br style="">
<span style="">I am currently a first year master student at the
Federal University</span><br style="">
<span style="">of Minas Gerais, under orientation of professor
Fernando Pereira. I</span><br style="">
<span style="">have graduated in this school on the year of
2011. I've worked with</span><br style="">
<span style="">software development for five years, in different
projects. I built a</span><br style="">
<span style="">Data Warehouse viewer and Business Intelligence
graphics and gauges</span><br style="">
<span style="">using Delphi; Commercial programs in MS Visual
FoxPro; Improvements in</span><br style="">
<span style="">a logistics planning system, in Excel VBA. I have
five years of</span><br style="">
<span style="">experience with SQL. Now, I'm working with LLVM,
writing passes to</span><br style="">
<span style="">support scientifical researchers.</span><br
style="">
<br style="">
<br style="">
<span style="">Why I can do well in this project</span><br
style="">
<span style="">------------------------------</span><span
style="">---</span><br style="">
<br style="">
<span style="">I believe that I am a good candidate for this
Summer of Code for four</span><br style="">
<span style="">reasons. Firstly, I am a master student of
computer science in the</span><br style="">
<span style="">Federal</span><br style="">
<span style="">University of Minas Gerais (UFMG), which is one
of the best schools of</span><br style="">
<span style="">computer science in Brazil. To support this
claim, notice that UFMG</span><br style="">
<span style="">has the maximum grade in the ranking of CAPES
(Coordenação de</span><br style="">
<span style="">Aperfeiçoamento de Pessoal de Nivel Superior -
the Evaluation Body</span><br style="">
<span style="">under the Ministry of Education). I am working in
the compiler's lab,</span><br style="">
<span style="">under the advice of Fernando Pereira, who was a
Summer of Coder</span><br style="">
<span style="">himself in 2006. Second, I have good experience
with programming, with</span><br style="">
<span style="">almost six years of experience in many languages
(Java, C, C++, C#,</span><br style="">
<span style="">VB, VBA, Object Pascal, SQL, etc...). Third, I'm
already working with</span><br style="">
<span style="">LLVM passes. I already know some LLVM tools and I
already have the</span><br style="">
<span style="">basic knowledge to start this project. I have
written a pass to</span><br style="">
<span style="">instrument LLVM bitcodes, in order to find the
minimum and maximum</span><br style="">
<span style="">values that each variable assumes at runtime.
This code has been used</span><br style="">
<span style="">to check the precision of our range analysis. It
is publicly available</span><br style="">
<span style="">at (</span><a moz-do-not-send="true"
href="http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation">http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation</a><span
style="">). </span></div>
<div><span style="">Finally, in the lab where I work there are six
other people</span></div>
<div><span style="">who work everyday with LLVM; thus, in addition
to getting help in the</span><br style="">
<span style="">forum, I can easily talk about my project to my
colleagues.</span><br style="">
<br style="">
<span style="">References</span><br style="">
<span style="">----------</span><br style="">
<br style="">
<span style="">[1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala
and Rupak Majumdar.</span><br style="">
<span style="">Checking Memory Safety with Blast. FASE 2005,
LNCS 3442, pages 2-18,</span><br style="">
<span style="">Springer-Verlag, 2005</span><br style="">
<br style="">
<span style="">[2] ABCD: eliminating array bounds checks on
demand. Rastislav Bodik</span><br style="">
<span style="">and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM,
pages 321-333</span>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a class="moz-txt-link-freetext" href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a>
<a class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a>
</pre>
</blockquote>
<br>
</body>
</html>