[PATCH] D69813: [analyzer] CERTStrChecker: Model gets()

Wed Nov 13 19:51:01 PST 2019

Charusso marked an inline comment as done.
Charusso added inline comments.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:201-207
+void CERTStrChecker::evalGets(const CallEvent &Call, const CallContext &CallC,
+                              CheckerContext &C) const {
+  unsigned DestPos = *CallC.DestinationPos;
+  const Expr *DestArg = Call.getArgExpr(DestPos)->IgnoreImpCasts();
+  SVal DestV = Call.getArgSVal(DestPos);
+
+  auto Report = getReport(*BT, Call, CallC, C);
----------------
NoQ wrote:
> Charusso wrote:
> > NoQ wrote:
> > > Charusso wrote:
> > > > Charusso wrote:
> > > > > NoQ wrote:
> > > > > > NoQ wrote:
> > > > > > > All right, so basically what you're saying is that literally every invocation of `gets()` deserves a warning. This means that for all practical purposes your checker //is// an AST-based checker, just implemented with path-sensitive callbacks. A path-sensitive checker emits warnings based on multiple events that happen sequentially along the path (use-after-free: "memory deallocated - that same memory used", division by zero: "value constrained to zero - something is being divided by that same value", etc.) but your checker emits the warning by looking at only one statement: "`gets()` is invoked".
> > > > > > > 
> > > > > > > Do i understand correctly that your plan is to use path-sensitive analysis for fixits only? But you can't emit fixits for any truly path-sensitive warning anyway. Fixits must work correctly on all execution paths, so you cannot emit a correct fixit by looking at only one execution path. In order to emit fixits, you need to either resort to a pure AST check anyway ("this expression refers to an array of fixed size"), or maybe implement auxiliary data-flow analysis for a certain must-problem (eg., "the buffer argument may have exactly one possible value across all paths that reach `gets()`"). But in both cases the path-sensitive engine does literally nothing to help you; all the data that you'll need for your fixit will be available from the AST.
> > > > > > Like, i think this was an interesting investigation and i was genuinely curious about how this turns out to be, but for now it seems that the problem you're trying to solve cannot be solved this way. Path sensitive analysis is fundamentally applicable to only 50% of the problems (to "may-problems" but not to "must-problems"), and the problem you're trying to solve is in the latter category. I believe you'll have to fall back to the relatively boring task of adding fixits to `security.insecureAPI.gets`; but then, again, if you manage to employ use-def chains for this problem, that might be quite an inspiring start.
> > > > > Most of the time the given allocation to hold the arbitrary string happens in a local scope. After that we see `fscanf(dst)`, `gets(dst)`, `memcpy(dst)`, `strncpy(dst)`, stuff... which pushes new data into that memory block, and then the cool developers write that down: `dst[42] = '\0'` which means all the reports should be thrown away in a path-sensitive manner on `dst`. Reallocations, re-bindings, non-AST stuff could handled very easily with the non-AST checker, like that one.
> > > > > 
> > > > > Sometimes we are work with destination-array like `memcpy(Foo[Bar.Baz]->Qux, ...)` which could not really handled with just a simple AST-based checker. I could not say at the moment we could handle it with symbols, but we have a much larger scope of information by symbols.
> > > > > 
> > > > > Most of the time because of the Analyzer is much smarter than the Tidy we could emit fix-its with the help of flow-sensitiveness very easily. I would create huge white-lists what we want to fix-it, and what we could not, but at some point if we model the symbols better, we can.
> > > > > 
> > > > > Other than that easy false-positive suppression and tiny flow-sensitive rebinding stuff, we could be sure what is going on by each string-manipulation. The `gets()` is a toy example where at most a `grep -rn 'gets('` could do better analysis than us.
> > > > > 
> > > > > The real world looks like that:
> > > > > ```
> > > > > 1	encryptedpasswordlen = ((strlen(passwd) + RADIUS_VECTOR_LENGTH - 1)
> > > > >  	/ RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH;
> > > > > 2	cryptvector = palloc(strlen(secret) + RADIUS_VECTOR_LENGTH);
> > > > > 3	memcpy(cryptvector, secret, strlen(secret));
> > > > > ...
> > > > > 4	for (i = 0; i < encryptedpasswordlen; i += RADIUS_VECTOR_LENGTH) {
> > > > > 5	  memcpy(cryptvector + strlen(secret), md5trailer, RADIUS_VECTOR_LENGTH);
> > > > > ...
> > > > > ```
> > > > > - from `postgresql/src/backend/libpq/auth.c`
> > > > > 
> > > > > At `3` we would emit a warning, because the null-termination left by the wrong size of the string, but at `5` we see that, it left, because at that offset the string continues, and dunno, on `6` when we model every flow-sensitive information, the string left non-terminated.
> > > > > 
> > > > > Of course each of that stuff is local and AST-based (with huge overhead of rebindings and impossible false-positive suppression), but when you have two of it, that is when the fun begins.
> > > > > if you manage to employ use-def chains for this problem, that might be quite an inspiring start.
> > > > 
> > > > We have regions so we do not need to rely on such chains in the AST-world, if I get your idea right by the "Use-define chain" wiki [1]. Btw. it is not that difficult problem in the AST-world, you need to create a recursive AST-matcher on the `DeclRefExpr` with `std::function`.
> > > > 
> > > > Basically, I want to implement all the STR rules in a logical order, here is one of the examples from STR32-C [2] which is my last planned project at the moment:
> > > > 
> > > > ```
> > > > void lessen_memory_usage(void) {
> > > >   wchar_t *temp;
> > > >   size_t temp_size;
> > > >  
> > > >   /* ... */
> > > >  
> > > >   if (cur_msg != NULL) {
> > > >     temp_size = cur_msg_size / 2 + 1;
> > > >     temp = realloc(cur_msg, temp_size * sizeof(wchar_t));
> > > >     /* temp &and cur_msg may no longer be null-terminated */
> > > >     if (temp == NULL) {
> > > >       /* Handle error */
> > > >     }
> > > >  
> > > >     cur_msg = temp;
> > > >     cur_msg_size = temp_size;
> > > >     cur_msg_len = wcslen(cur_msg);
> > > >   }
> > > > }
> > > > ```
> > > > 
> > > > They really want to represent the wild, and please think of that problem in terms of an AST-checker versus in terms of `getAsRegion` and `getDynamicElementCount` to compare the size of the allocated memory block and inject that: `cur_msg[temp_size - 1] = L'\0';` because the array would overflow. How cool is the Analyzer and how smart to do so.
> > > > 
> > > > It would took at most 10 minutes to implement if the `evalBinOp` would work or the main 10 years old implementation of obtaining the element-count would work. I am on the way to fixing the latter, but it will be more path-sensitive info, than you could imagine, like reusing the "zombie" size-expression turned out to be a hard problem. And it will be a lot easier to solve such problems, I believe.
> > > > 
> > > > The local scope is the key, and that checker at the moment only tries to rewrite destination-arrays which are local. I think we could see if we emit multiple reports on a given call so due to ambiguity we would drop such fix-its. With an AST checker I could not imagine how difficult it would be.
> > > > 
> > > > It is rather a research at the moment, because I have encountered dozens of silly stuff, beginning with the `getExtent()`, so I cannot say this direction is the 100% future, but I have picked the Analyzer over Tidy, for that reason, it is smarter.
> > > > 
> > > > [1] https://en.wikipedia.org/wiki/Use-define_chain
> > > > [2] https://wiki.sei.cmu.edu/confluence/display/c/STR32-C.+Do+not+pass+a+non-null-terminated+character+sequence+to+a+library+function+that+expects+a+string
> > > > It would took at most 10 minutes to implement if the `evalBinOp` would work or the main 10 years old implementation of obtaining the element-count would work.
> > > 
> > > You have multiple different checkers that you want to implement and for each of them there are two parts of the problem:
> > > (1) Emit the warning,
> > > (2) Emit a fixit for that warning.
> > > 
> > > Path-sensitive analysis is perfect for (1), at least for some checkers (not for the one in this patch).
> > > 
> > > But if you try to rely on path-sensitive analysis for (2), your result will simply be incorrect for the reason stated above: there may be other execution paths that you haven't taken into account. And when you say
> > > 
> > > > I would create huge white-lists what we want to fix-it
> > > 
> > > , this literally means repeating a lot of the work you did in D45050, as you have to write down matchers (or a CFG-based data flow analysis) for the cases that you really can fix. At this point you will not only be aware of the allocation site from which you can extract the size-expression, but also of all other possible allocation sites.
> > > 
> > > So i honestly believe that you should drop the idea of using path sensitive analysis to help you with fixits, and instead focus on the two more-or-less-independent tasks of (a) developing path-sensitive checkers for the CERT problems you're interested in and (b) developing fixits for such checkers in a syntactic manner without relying on the information obtained via the path-sensitive analysis.
> > >> It would took at most 10 minutes to implement if the evalBinOp would work or the main 10 years old implementation of obtaining the element-count would work.
> > > 
> > > You have multiple different checkers that you want to implement and for each of them there are two parts of the problem:
> > > (1) Emit the warning,
> > > (2) Emit a fixit for that warning.
> > This string-handling misuse of the language C consists of thousands of entries across random open source projects, that is why I would like to suggest fix-its. With the help of path-sensitive stuff I would like prove the fix-its are being fine.
> > 
> > > Path-sensitive analysis is perfect for (1), at least for some checkers (not for the one in this patch).
> > > 
> > > But if you try to rely on path-sensitive analysis for (2), your result will simply be incorrect for the reason stated above: there may be other execution paths that you haven't taken into account.
> > I know that the path-sensitive analysis means that: "There is a path". But I also know that there is one single function body where the `(allocation, string manipulation, return data)` triplet takes place. Here we should take every of the execution paths because we are in a local scope. We could be 100% sure when the obtained size-expression's region is not safe to reuse. I do not think about function-boundaries because the peoples write code like that triplet. Sometimes the size-expression is coming from a function-call, but that should be fine to obtain.
> > 
> > > And when you say
> > > 
> > >> I would create huge white-lists what we want to fix-it
> > > 
> > > , this literally means repeating a lot of the work you did in D45050, as you have to write down matchers (or a CFG-based data flow analysis) for the cases that you really can fix. At this point you will not only be aware of the allocation site from which you can extract the size-expression, but also of all other possible allocation sites.
> > There is one allocation site 99,99% of the time, where the 0,01% is your counter example in this review. Of course there must be such case in the wild, somewhere, deep in the woods, and of course the most error-prone case is the 0,01%, but the other 99,99% has the same issue, and we could provide fix-its easily if we do not focus mostly on the 0,01%. I want to make my stuff non-alpha, even none of the non-alpha checkers or not the Tidy can provide 100% accuracy, that stuff with fix-its should be 100% accurate. So in case of the 0,01% we detect it and we do not fix-it, that is it. And we are hoping someone creates summary-based analysis, so that we can rewrite the 0,01% as well.
> > 
> > > So i honestly believe that you should drop the idea of using path sensitive analysis to help you with fixits, and instead focus on the two more-or-less-independent tasks of (a) developing path-sensitive checkers for the CERT problems you're interested in and (b) developing fixits for such checkers in a syntactic manner without relying on the information obtained via the path-sensitive analysis.
> > 
> > Plot twist: nor the AST-checkers and nor the path-sensitive-checkers could solve this issue. You cannot state out that my approach is bad, so I should change it. I cannot state out my approach is good, so we should go for it.
> > 
> > I believe in that when I dropped my Tidy-career then I have picked the right tool which I can improve to create 100% accuracy and model the impossible-to-model stuffs, like that STR rules. Of course, in a path-sensitive manner, to drop every heuristic, to model everything which necessary with offsets and allocations, and for false positive suppression. The key here, to modify the Analyzer from that statement "There is a path" to "There is only a path", according to the use-case of the string manipulation being a single path.
> > There is one allocation site 99,99% of the time, where the 0,01% is your counter example in this review.
> 
> It might be 0,01% for this checker (and if so, a simple AST-based solution will work equally well), but for other checkers there will be a lot more problems: like, how do you null-terminate a string if the size of the string depends on the execution path?
> 
> > The key here, to modify the Analyzer from that statement "There is a path" to "There is only a path", according to the use-case of the string manipulation being a single path.
> 
> This is a possible approach to one of our [[ http://clang-analyzer.llvm.org/open_projects.html | open projects ]], "Implement a dataflow flamework", which would allow us to solve must-problems as well as may-problems. Like, it sounds easy: if we simply were able to figure out whether the analyzer has explored all paths or not in the current analysis, we will be able to solve may-problems in all cases in which the analyzer has actually explored all paths, by post-processing the `ExplodedGraph`. I've been advocating for this approach for some time in the past, but this approach is clearly a dead end because in most of the interesting cases (eg., in presence of any sort of loops) it'll reply "i don't know" ("we were clearly unable to explore all paths").
> 
> The more principled solution is to make an actual data flow framework in the usual sense (i.e., an API that would allow us to write "must-problem" checks in the same way as we currently write path-sensitive checks but with the extra join operation). I heard that @gribozavr has some plans in this area.
> how do you null-terminate a string if the size of the string depends on the execution path?

Here is your example a little-bit modified:
```
char *x = 0;
char y[10];

if (coin()) {
  x = malloc(20);
} else {
  x = y;
}

char src[] = "Foo";
strcpy(x, src);
```

Because the source string is only available after the allocation we cannot make sure the allocation could hold the entire source object which appropriate allocation would be: `malloc(strlen(src) + 1)` or something similar. Also the `free()` is another hard problem immediately.

It is that 0,01% case which you do not encounter in the wild. The usual approach: having a memory block - fill it - return it. But the Static Analyzer made for that purpose, so it can detect such path-sensitive information and prevent to emit a fix-it.

-------------

> "Implement a dataflow flamework"

That is a cool idea, but for the first time I want to think about local scopes. Within a single function body, that is the scope I can safely measure, I believe. If it turns out to be a good idea, we could improve on it step-by-step. I really wanted to measure my stuff, but every time when I have finished I had to rewrite my stuff near from scratch, so I cannot say this checker is somewhat special to rewrite half of the Analyzer. May it will be.

-------------

I have learnt nothing from Tidy, but that: prove that the fix-it is valid, if it is, emit, otherwise throw away. The Tidy cannot measure such information, and that is where the flow-sensitive stuff kicks in.

If our engine would be perfect, with dataflow-ctu-summary-dunnowhat analysis collaborating together may we would consider to rewrite such crazy difficult to rewrite code. I think we are that far away of that level, we have to avoid to think of that kind of fix-its. But the 99,99% totally worth and easy to measure with the help of dynamic size information, like I have implemented mostly of the STR31-C and STR32-C examples based on that idea with fix-its.

I cannot emphasize enough well how simple is the way the string is manipulated and how easy it should be modeled with memory block regions or the dynamic "used size" of the memory block, and stuff like that. It is just a sequence of data-flow, from one given allocation to one given return, with the craziest offsets on the destination-array and with the craziest custom allocators.

If you would like to cover more, we are not interested in multiple-buffers, but rather one single buffer with a custom allocator where the size expression should be easy to obtain. I have taken that heuristic and it worked out quite well. Like the Git's allocator injects the null terminator, they say, in a comment, safely - when they pass a wrong size to it: `strlen(src)`. So it would find possible false positives, but the commenting is not really the way we null-terminate.

-------------

This is one of the most needed checkers, so I understand why you want to make it as precise as possible with the largest set of possibilities, but incremental development, eh? Peoples are lame and cannot count so they write `memcpy(dest, "Foobar", 6)`, where the next statements are telling us if the programmer was lame, or not (and on the previous line the `memcpy()` has an appropriate size expression). Please think about that kind of vulnerability when there is a Windows update every week. That was my base case when I started this project 2 years ago.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69813/new/

https://reviews.llvm.org/D69813