[cfe-dev] [analyzer] Looking up previously stored custom state for ParmVarDecls that have gone out of scope

Thu May 3 19:30:03 PDT 2018

On 5/2/18 6:22 PM, Timothy J. Wood wrote:
>
>> On May 2, 2018, at 4:35 PM, Artem Dergachev <noqnoqneo at gmail.com> wrote:
>>
> […]
>> Yep, that's because the underscored _outError is dead as a variable, i.e. no longer used in the program, so we can forget about it. The checker is not prevented from reincarnating the variable (because currently checkers are omnipotent), but that's not the right thing to do. This isn't "renaming" - the underscored reg_$2 should have never been created and is essentially created artificially by the checker due to API misuse that we can't easily catch (though we're working on it).
> That makes sense, but when _outError goes out of scope, checkDeadSymbols wasn’t getting called (though I can see there is a node in the exploded graph for dead symbol processing on that path).
>

PreStmtPurgeDeadSymbols is essentially a garbage collection pass. It is 
not called every time something actually dies, but it is called often 
enough to (1) guarantee that all leaks are reported on pretty much the 
same line of code on which they actually occur and (2) the analyzer 
keeps its state small enough, which is critical for performance (lookup 
speed).

Now, checkDeadSymbols may fail to be called at PreStmtPurgeDeadSymbols 
for exactly one reason: no symbols actually die (and, well, that's a 
bug, but currently nothing really relies on it yet, so it's omitted for 
performance).

>> Now note that even though _outError is dead, *symbol* &SymRegion{reg_$1<NSError ** outError>} is alive, because it is still stored in another variable, namely the original non-underscored outError. Therefore the constraint on the symbol is also preserved.
>>
>> Additionally, argument values are removed from the Store (memory model) because parameter variables become dead, but they are still present in the Environment (syntactic expression values) because the call-expression is not yet fully evaluated, and therefore its sub-expressions that represent argument values are still considered to be important to keep.
>>
>> With that reasoning in mind, it is generally advised to use the Environment to obtain argument values, rather than the Store. The convenient way of doing that is by using the CallEvent API. A CallEvent object is readily available in checkPreCall and checkPostCall callbacks. If it's not available, you can easily construct it via CallEventManager::getCaller(). For obtaining the argument values you don't necessarily need to pass an old ProgramStateRef (that was actual when we just entered the call) to getCaller(). The current ProgramStateRef will be just fine, because all we need from it is expression values, and expression values     don't mutate.
>>
>> What you won't be able to do with the current program state is obtain the old value of **_outError. *_outError is in the Environment, but **_outError is in the old Store only.
>>
>> We've been thinking about adding a way of retrieving an old ProgramState that corresponds to the beginning of previous stack frames. This would have helped and we wanted it for similar reasons.
>>
>> ---
>>
>> I guess that was too much lecturing, i should focus on the task at hand.
> Heh — I’ll spend some time trying to understand the above. As you can tell, I’m still stumbling around bumping into things in the dark.

Well, you picked up a relatively advanced problem. The checker you're 
trying to make should be totally possible, but it's not something that 
has been done before.

>> Am I understanding correctly that you are trying to compare **_outError at the beginning of the function and at the end of the function and then emit the warning based on the return value?

Just to make sure i put the right number of asterisks:

(0) There's the lvalue of the _outError variable, namely &_outError.
(1) There's the argument rvalue of _outError, namely 
&SymRegion{reg_$1<NSError ** outError>}. We can retrieve it from the 
CallEvent at any time during the call.
(1') There's the ill-formed value &SymRegion{reg_$2<NSError ** 
_outError>} which we should really avoid producing.
(2) There's the old value of *outError/*_outError, which will in our 
example be something like &SymRegion{reg_$N<NSError * 
SymRegion{reg_$1<NSError ** outError>}}. You can get that by 
dereferencing 1) at the beginning of the function. We're interested in 
whether the new value obtained by dereferencing 1) at the end of the 
function would be any different.
(3) There's the value that represents contents of the actual NSError 
object. It's opaque and immutable, and therefore not super interesting.

>
> The goal is to check adherence to the NSError contract more strongly. Given such a function or method (has a non-void return and a outError parameter), then:
>
> - If the return value is known to be non-zero, outError does not need to be filled.
> - If a return value is known to be zero, then:
> 	- If outError is known to be null, no *outError need be written
> 	- If outError is known to be non-null, *outError must be written to with a non-null value

Yup, so at the end of path just look if (1) can be State->assume()'d to 
be non-null. If not, skip the check, this path is fine.

> - A write to *outError can occur:
> 	- Directly via *outError = /* some expression guaranteed to be non-null */;
> 	- Indirectly by calling another error returning API, but only on the branch where the callee returned a zero result
>
> - In all cases, the original outError passed should be the one filled. That is, this is not valid:
>
> 	- (BOOL)bad:(NSError **)outError; {
> 		NSError *e;
> 		outError = &e;
> 		*outError = [NSError errorWithDomain…];
> 		return NO;
> 	}
>
> Some subtle bugs can be present in NSError returning code that I’d like to also warn about:
>
> - After a call to a error returning function or method that returns non-zero (success), the outError passed to the callee should be considered clobbered even if it was previously filled

Hmm, am i understanding correctly that you want to actually discriminate 
between different *ways* of writing into *outError? I think we have the 
following cases:

1. Direct assignment. We treat it as "job done right".
2. Direct assignment in an inlined error-returning call. We also treat 
it as "job done right".
3. Invalidation by passing outError into an error-returning call that we 
cannot inline.

Both 1. and 2. sound like we don't need to do anything - inlining 
magically does things for us... right?

In case of 3., we will have no knowledge about the return value of the 
function. Therefore we have no data to conclude that the function has 
actually filled *outError for us.

The checker is allowed to split the program state artificially, into two 
paths: one path on which the return value is assumed to be indicating an 
error and the *outError is updated to contain that error, and another 
path on which the return value is assumed to not be indicating an error 
and the *outError is untouched. But i believe that such state split is a 
wrong thing to do because state split doesn't simply express the 
potential possibility of two different situations: it expresses the 
*actual* possibility of two different situations. That is, by 
introducing a state split we declare that infeasibility of each of the 
states is an error on its own (eg., when we split states on "if (x)", 
infeasibility of one of the branches means dead code). This is not the 
case for error-returning functions, because sometimes it is known that 
the function would not return an error (eg. imagine a parseURL() method 
that parses a concrete string literal "https:/xxx.yyy/" - it should 
always succeed). So i believe state split should not be introduced, 
though it might be an interesting experiment.

And without performing state splits, it is impossible to decide whether 
*outError should be filled or empty after an external API call on the 
current path. So most likely we'd choose to treat *outError as initialized.

If i understood the contract you're trying to check correctly, at the 
end it should be irrelevant how exactly is the *outError initialized. It 
only matters that the value within it has changed.

---

Hmm, well, actually, in case 3., it may be useful to store the return 
symbol of the conservatively evaluated call and see if later the program 
checks if that symbol indicates an error. Once the program itself 
introduces the state split, we can make a deferred decision on whether 
*outError was initialized by the call or not. If the symbol dies before 
such state split is encountered, we're back where we started, otherwise 
we can extract the info we wanted. This trick is currently used by an 
alpha PthreadLockChecker to figure out if a pthread_mutex_destroy() call 
actually destroys the mutex. It tracks "Schrödinger mutexes" that are 
both destroyed and alive simultaneously until a state split is performed 
over the return value of their respective pthread_mutex_destroy() - see 
https://reviews.llvm.org/D32449

Another thing about state splits is that every state split doubles the 
remaining work: we need to re-analyze all remaining paths twice. That's 
one more reason (apart from correctness concerns) to try to reduce them.

> - Reading *outError before it is known to contain a valid value should be an error

This should be easy because we know what value it contained initially.

> - A call to a nullable receiver that happens to be nil might accidentally return zero status w/o really writing to *outError

I think that's exactly how we would model a message to a nil receiver. 
It shouldn't look like it's changing *outError in any way, so i hope 
it'll work automagically.

> To do all this I need to track whether outError arguments have been checked to be null, whether *outError has been written to, and be able to check those things at the point of return. In some case I can’t know the exact return value:
>
> 	- (BOOL)foo:(NSError **)outError; {
> 		return [self bar:outError];
> 	}
>
> but the state splits make this work out nicely; on the side where -bar: returns zero, I can assume *outError was written if outError isn’t null, and on the side were -bar: returns non-zero, we are returning non-zero and don’t need a write to *outError.
>
> Along with all this, not only do I need to track the nullness of outErrors and *outErrors, but of local NSError variables too. A common pattern is:
>
> 	- (BOOL)foo:(NSError **)outError; {
>
> 		NSError *localError;
> 		if (![self bar:&localError]) {

The analyzer knows that &localError is non-null, simply because it's a 
concrete variable address. Not sure if that was the question.

> 			// maybe look at localError’s domain and code and try something else
>
> 			// … but then eventually fail
> 			if (outError) {
> 				*outError = localError;
> 			}
> 			return NO;
> 		}
> 	}
>
> These are the “simple” cases, but there are some cases that will probably be infeasible w/o making assumptions about specific API in Foundation (asynchronous operations writing to __block variables, NSFileCoordinator taking a outError w/o a BOOL result, and probably others). It will probably be more reasonable to skip those.

For a lot of APIs that are "always known to possibly fail", state splits 
might be a good thing to try. And, well, yeah, some may require more 
specific modeling.

>> I think you should do a:
>>
>>    REGISTER_MAP_WITH_PROGRAMSTATE(const StackFrameContext *, SVal)
>>
>> And put the original SVal for **_outError into it. Then at the end of the function see if the value has changed, by retrieving *_outError via
>>
>>    CheckerContext.getStateManager().getCallEventManager()
>>    .getCaller(CheckerContext.getCurrentStackFrame, CheckerContext.getState())
>>    .getArgSVal(error argument number)
>>
>> and dereferencing it.
> I was somehow under the assumption that SVals were transient and not to be stored, but if that’s a faulty assumption it could be a useful tool.

Storing SVals is not problematic at all, but in most cases it wasn't the 
thing that anybody actually wanted to do. In most cases checkers are 
interested in specifically symbols or specifically regions.

Your checker is not like other checkers, so my intuition isn't 
immediately working here. If you can prove that you'll always be 
interested in symbols only, you should store symbols. If you suspect 
that you'll need to store a concrete value on a certain event, SVals are 
the way to go. They are simple value-types, and i think they should have 
all (or almost all) the boilerplate necessary to use them as immutable 
map values.

>
> Thanks again!
>
> -tim
>