[cfe-dev] Purpose of GenericTaintChecker

Fri Jun 3 11:02:56 PDT 2016

 > What I'm trying to achieve is to check if any tainted variables has 
been passed into sensitive functions.

The first "Aha!" here would be to realize that taint is not a property 
of a variable - it is a property of the value stored in it, and the 
analyzer's core engine allows you to easily work with values directly, 
without spending any effort to compute these values.

The analyzer denotes values which are not known during static analysis 
(such as values coming from user input) with *symbols* and performs 
algebraic operations on symbols. During program execution (or, 
equivalently, during analysis, a.k.a. "symbolic execution"), those 
symbols are passed around from one variable to another (through 
assignments etc. - that is, for instance, after declaration statement 
"int a = b;" both variables 'a' and 'b' hold the same symbol). Results 
of algebraic operations on tainted symbols are also considered to be 
tainted. Symbols read from tainted pointers are considered to be tainted 
themselves, etc.

GenericTaintChecker, aka alpha.security.taint.TaintPropagation as it's 
called in Checkers.td, is subscribed on certain function call events - 
such as, say, getc(). Their return values (etc. - say for scanf() it's 
values written into pointers passed as arguments) are denoted as symbols 
by the core. GenericTaintChecker takes these symbols and marks them as 
tainted.

Then the analyzer core models how these symbols move around during 
execution. No checker is responsible for that - it's done automagically. 
The core doesn't, most of the time, care if these symbols are tainted or 
not - it simply models operations on them. It makes no additional effort 
to mark results of algebraic operations on tainted values as tainted - 
it can compute taint of an algebraic symbolic expression by simply 
looking at the expression (if it references any tainted symbols). Same 
happens to symbols loaded from tainted pointers - *the hierarchy of 
symbols is designed to remember each symbol's origins in an out of the 
box manner*, so it's easy to see if any composite symbols are coming 
from a tainted source.

Whenever core encounters calls to other functions, which it doesn't 
model (say, because their bodies aren't available), their return values 
are not tainted even if arguments of the call are tainted: because 
otherwise we'd get a lot of false positives. So in case when we need to 
mark return values of functions as tainted depending on taintedness of 
arguments, GenericTaintChecker is responsible for modeling that. This is 
the "taint propagation" thing. For instance, taint propagates through 
strcat(), which allows us to theoretically catch SQL injections.

Finally, tainted symbols may reach sensitive functions. For example, 
tainted input string in call to system() allows execution of arbitrary 
code. This is the *third* kind of functions on which GenericTaintChecker 
is subscribed - upon noticing tainted arguments passed to such 
functions, it issues warnings.

If you want to extend this functionality by adding your own:
(1) Taint sources,
(2) Taint propagation rules,
(3) Warnings for tainted value usage,
Then you can either extend the relevant section of GenericTaintChecker, 
or write your own checker - it doesn't really matter, because taint 
information is visible to all checkers. It might be more comfortable to 
extend GenericTaintChecker because it allows some code re-use. If you 
write your own taint checker, you can either use it together with 
GenericTaintChecker (its work on taint sources and taint propagation may 
be of use) or disable GenericTaintChecker completely (say, if you don't 
want to see its warnings).