[cfe-dev] [analyzer] Problem tracking taint applied to regions

Artem Dergachev via cfe-dev cfe-dev at lists.llvm.org
Tue Mar 20 15:05:06 PDT 2018


Whoops +cfe-dev.

On 3/20/18 3:01 PM, Artem Dergachev wrote:
> Hello Julian,
>
> You seem to have made your way very far on your own, but it doesn't 
> look to me like you're on the right track.
>
> On 3/20/18 2:34 AM, Julian Ganz via cfe-dev wrote:
>> Hello cfe-dev,
>>
>> I am currently trying to write a custom checker for the static analyzer.
>>
>> The idea is to perform taint analysis on C++ code with sources, sinks 
>> and filters being provided by configuration. A checker which does 
>> something similar exist already,e.g. [1]. However, in our use-case, 
>> we want to mark data/values (rather than functions) as sources and 
>> sinks.
>> Consider for example a piece of security-relevant configuration data 
>> which must not be writable by an potential attacker. For now, I use 
>> the following example for testing:
>>
>> | class Foo {
>> | public: int x; // Foo::x marked as tainted (e.g. source)
>> | };
>> | static int y; // Marked as sink
>> |
>> | int main() {
>> |   Foo f;
>> |   y = f.x;
>> |   return y;
>> | }
>>
>> Now, clang already provides some infrastructure for taint analysis, 
>> e.g. `ProgramState::addTaint()` and `ProgramState::isTainted()`.The 
>> internal taint map used in the store tracks symbolic expressions. I 
>> decided to use this existing infrastructure for obvious reasons, e.g. 
>> benefitting from existing and future taint propagation 
>> infrastructure. However, it bit me and now I'm stuck.
>>
>> The taint is introduced in a `checkLocation()` implementation. The 
>> first thing I do is unwrapping a `MemberExpr` (e.g. removing casts) 
>> and decide whether to taint it or not based on the referred 
>> declaration. This far, it works as intended. However, it appears that 
>> the `SVal loc` provided as a parameter to this handler does not refer 
>> to a `SymExpr`, neither does `loc.getAsRegion()` return a 
>> `SymbolicRegion`. In order to taint the piece of data, I create my 
>> own symbolic expression as well as an associated `SVal` and bind it 
>> to the member expression `S`:
>> | // The taint map can only be used for tracking symbolic expressions.
>> | auto symExpr = loc.getAsSymExpr();
>> |
>> | if (!symExpr) {
>> |   // The current `loc` is not suitable for carrying taint. 
>> Construct a new one.
>> |   if (const auto region = 
>> dyn_cast_or_null<TypedValueRegion>(loc.getAsRegion())) {
>> |     if (symExpr = C.getSymbolManager().getRegionValueSymbol(region)) {
>
> You should not create your own symbols in the checker. At least, not 
> until you 100% know what you are doing.
>
> It's like the analyzer says "i know that there are two sheep" and you 
> later say "let $x denote the number of sheep" - you should just take 
> the value that's already there instead. Just worse because you use a 
> specific notation (SymbolRegionValue) rather than a nameless "x", 
> which matters (unlike school algebra).
>
> There is already a correct SVal (which is not necessarily a symbol) in 
> the program state that denotes the value of the region; you should 
> retrieve this value by using the State->getSVal(Region, QualType) API. 
> If there isn't such value yet, it will be denoted with a symbol 
> automatically and presented to you.
>
>> |       loc = C.getSValBuilder().makeLoc(symExpr);
>
> The symbol you've created is an integer. It's not a loc-type symbol, 
> i.e. not a pointer or a reference or an lvalue thing. This operation 
> should ideally fail.
>
>> |       state = state->BindExpr(S, C.getLocationContext(), loc);
>
> You are changing the value of the syntactic expression (in the 
> "Environment"), but not changing the value in the symbolic memory 
> model (the "Store"). Well, ideally, the checker should do neither.
>
> Replacing values of expressions in the Environment is a catastrophic 
> thing to do - they're essentially immutable by design (though 
> unfortunately we have a few places in the analyzer where we still do 
> that). Environment is "spacial", not "temporal": it captures a single 
> moment in time and explores values of expressions in that moment of 
> time, and these values are not allowed to change because any change 
> assumes some sort of flow of time.
>
> Replacing the value stored in a region in the Store is an equivalent 
> of saying "the program has modified contents of memory at this point", 
> which is also not what you're looking for, and it's risky to do unless 
> you take full responsibility of modeling the statement (for now 
> checkers can only take such responsibility for call expressions and 
> for branch conditions); otherwise checkers that subscribed on the 
> callback before you would see the old value, and checkers that go 
> later would see the new value.
>
>> |     }
>> |   }
>> | }
>> |
>> | // Add taint if possible
>> | if (symExpr) {
>> |   state = state->addTaint(symExpr);
>> |   C.addTransition(state);
>> | }
>>
>> Invoking `dumpTaint()` on ` state` shows that the expression did 
>> indeed make it into the taint map, `state->isTainted(S, 
>> C.getLocationContext())` does return `true` and `state->dump()` 
>> reveals that `f.x` is now bound to a symbolic expression looking just 
>> like I'd expect (`f.x : &SymRegion{reg_$1<int f->x>}SVal `).
>
> "SymRegion{$x}" means "an unidentified segment of memory that starts 
> at address $x". That's definitely not what you want because 
> "reg_$1<int f->x>" is an unknown value of type "int" that can't denote 
> any reasonable memory address.
>
> I added an assertion recently that would have informed you of this 
> problem if you used a more recent clang.
>
> Note that symbols, much like expressions, always have a type which is 
> a valid QualType from the AST of the program.
>
>> Btw: I did not use a `check::PreStmt` specialized to a `MemberExpr` 
>> because I do not (yet) understand the exact semantics of the ` SVal 
>> `s and `Loc`s which I can get from the state (partly because of the 
>> utter lack of higher-level documentation on these things). They 
>> appear to be different: the code above does fail in a 
>> `checkPreStmt()`, one appears to refer to the location where the 
>> value is originating while some other appears to refer to something 
>> completely different although the method's name suggests its just the 
>> same.
>>
>> Using the information in subsequent post-statement checks is where 
>> I'm stuck. First, I noted that the taint is not propagated anywhere. 
>> I assumed that simple assignments were, for whatever reason, not yet 
>> considered (I also did not find anything in the generic taint 
>> checker), so I started implementing a 
>> `check::PostStmt<BinaryOperator>` checker. Since `isTainted()` on the 
>> statement behaved as expected on the last handler, I assumed the 
>> taint detection to be somewhat easier. Sure, I have to strip both the 
>> LHS and RHS of casts and paranthesis, but detecting taint should be 
>> easy, I thought. Turns out `isTainted()` never returns true.
>>
>> Now, dumping both the taint and the overall state using 
>> `ProgramState::dump()` and `ProgramState::dumpTaint()` reveals that 
>> the symbolic expression which was previously added to the taint map 
>> is still tainted, but the binding of this expression to `f.x` is 
>> gnone (`f.x: Undefined`).
>
> "Undefined" is not "gone"; it's an indication that something is indeed 
> undefined, as in, the analyzer has identified some undefined behavior 
> that definitely occurs in the program under analysis on the current 
> execution path which leads to something (in our case, a value of an 
> expression) being potentially equal to any arbitrary value.
>
> The lack of specific value would have been represented as "Unknown", 
> not "Undefined". Or, in case of Environment, the binding would have 
> disappear completely.
>
> In your example f.x is indeed undefined because the class has no 
> constructor and therefore the local variable will have uninitialized 
> contents. Undefined values are used by the analyzer for finding 
> related bugs - for instance you would have seen the bug in your 
> example if you ran the vanilla (i.e. unmodified) analyzer on it:
>
> $ clang --analyze test.cpp
> test.cpp:8:5: warning: Assigned value is garbage or undefined
>   y = f.x;
>     ^ ~~~
> 1 warning generated.
>
> So the UndefinedVal here is the "correct value" i've been talking 
> about in the beginning. You'd have obtained it by calling getSVal(). 
> The analyzer can prove to you that the field has undefined contents, 
> just by looking at your code. Taint analysis is not necessary in this 
> case: you already know that the value is completely undefined. Which 
> is why it's not even a symbol: it is pointless to denote an undefined 
> value with a symbol because any code that deals with such value would 
> be buggy and we need to terminate the analysis and emit a warning as 
> soon as the value is used at all.
>
> So you'd need a better test case to work with.
>
> Generally the whole idea of "certain fields are the source of 
> taint"doesn't sound good to me . In C++ fields are not properties; if 
> you put a value into them and don't touch the object otherwise, you're 
> guaranteed to read back the same value. It means that fields are not a 
> good source of taint: they merely store whatever you put into them. 
> You should not report taint if someone writes 0 to the field and then 
> reads it.
>
> I'd strongly suggest considering identifying sources of the values 
> stored in these fields that cause these fields to contain tainted 
> values - instead of blaming fields themselves. For example:
>
> - A function returned an object of type Foo. Upon return, immediately 
> mark the symbolic value of its field x as symbolic. Don't wait until 
> the read because it may be preceded with a write.
>
> - A method of Foo sets the object's field x to a tainted value. Upon 
> method call, immediately mark the symbolic value this object's field x 
> as tainted.
>
> - A function is known to touch a specific global object of type Foo 
> and set its field x to a tainted value. Upon call of that function, 
> immediately mark the symbolic value of this object's field x as tainted.
>
> This is the intended/supported workflow, and i strongly suspect that 
> the field-based approach wouldn't work well (i.e. will have 
> unavoidable false positives).
>
>
>> Does anybody on this list have an idea or an explanation why the 
>> binding vanishes? Or can naybody point me to move exhaustive 
>> resources on the excact semantics of the different kinds of `SVal`s 
>> turning up in different contexts? Help would be much appreciated.
>>
>> Btw: I'm stuck with LLVM/clang 4.0 provided by the package manager at 
>> the moment. Self-compiled clang 6.0 refuses to load modules (cannot 
>> resolve the `CheckerBase` symbol for some reason) although I ran 
>> `cmake` with `-DCLANG_PLUGIN_SUPPORT=ON`, 
>> `-DLLVM_BUILD_LLVM_DYLIB=ON` and `-DLLVM_ENABLE_MODULES=ON`.
>>
>> Greetings and thanks in advance,
>> Julian
>>
>> [1] https://github.com/franchiotta/taintchecker
>>
>> PS: Sorry for the missing line-wrapping. I did not yet manage to make 
>> Outlook behave like a decent eMail-Client
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>




More information about the cfe-dev mailing list