[PATCH] D69813: [analyzer][WIP] CERTStrChecker: Model gets()

Tue Nov 5 07:05:37 PST 2019

Charusso added inline comments.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:124
+  if (const SymbolicRegion *SR = DestMR->getSymbolicBase())
+    if (const Expr *SizeExpr = getDynamicSizeExpr(C.getState(), SR))
+      return exprToStr(SizeExpr, C);
----------------
NoQ wrote:
> Charusso wrote:
> > NoQ wrote:
> > > Charusso wrote:
> > > > Charusso wrote:
> > > > > NoQ wrote:
> > > > > > Again, you will have to highlight the allocation site with a note. Therefore you will have to write a bug visitor that traverses the size expression at some point (or, equivalently, a note tag when the size expression is evaluated). Therefore you don't need to store the expression in the program state.
> > > > > Yes, you have pointed out the necessary visitor, but it needs more thinking.
> > > > > 
> > > > > I have a memory region which could be any kind of "memory block region" therefore I have no idea where is the size expression. We are supporting ~20 different allocations, which is nothing compared to the wild with the not so uncommon 5+ parameter allocators. Therefore I still do not want to reverse engineer a small MallocChecker + ExprEngine + BuiltinFunctionChecker inside my checker. They provide the necessary `DynamicSizeInfo` easily, which could be used in at least 4 checkers at the moment (which I have pointed out earlier in D68725).
> > > > > 
> > > > > If I have the size expression in the dynamic size map, and I can clearly point out the destination buffer, it is a lot more simplified to traverse the graph where the buffer and its size comes from.
> > > > Well, you really do not want to store `SizeExpr` of `malloc(SizeExpr)` and you are right I will have to traverse from it to see whether the `SizeExpr` is ambiguous or not, where it comes from.
> > > > 
> > > > I want to rely on the `trackExpressionValue()` as the `SizeExpr` is available by `getDynamicSizeExpr()`, so it is one or two lines of code.
> > > > 
> > > > Would you create your own switch-case to see where is the size expression goes in the allocation and use `trackExpressionValue()` on it? So that you do not store information in the global state which results in better run-time / less memory.
> > > > 
> > > > At first I really wanted to model `malloc()` and `realloc()` and stuff, then I realized the `MallocChecker` provides every information I need. Would it be a better idea to create my own tiny `MallocChecker` inside my checker which does nothing but marks the size expression being interesting with `NoteTags`?
> > > > 
> > > > Also I am thinking of a switch-case on the `DefinedOrUnknownSVal Size` which somewhere has an expression inside it which I could `trackExpressionValue()` on.
> > > > 
> > > > Basically we are missing the rules what to use and I have picked the easiest solution. Could you share please which would be the right direction for such a simple task?
> > > > I want to rely on the `trackExpressionValue()` as the `SizeExpr` is available by `getDynamicSizeExpr()`, so it is one or two lines of code.
> > > 
> > > This won't work. `trackExpressionValue()` can only track an active expression (that has, or at least should have, a value in the bug-node's environment). You'll have to make a visitor or a note tag.
> > > 
> > > You can either make your own visitor (which will detect the node in which the extent information becomes present), or convert `MallocChecker` to use note tags and then inter-operate with those tags (though the interestingness map - "i mark the symbol as interesting so i'm interested in highlighting the allocation site" - or a similar mechanism). The second approach is more work because no such interoperation has ever been implemented yet, but it should be pretty rewarding for the future.
> > > This won't work. trackExpressionValue() can only track an active expression (that has, or at least should have, a value in the bug-node's environment). You'll have to make a visitor or a note tag.
> > So because most likely after the `malloc()` the `size` symbol dies, the `trackExpressionValue()` cannot track dead symbols? Because we could make the `size` dying base on the `buffer`, we have some dependency logic for that. It also represents the truth, the size is part of that memory block's region. After that we could track the expression of the `size`?
> > So because most likely after the malloc() the size symbol dies...?
> 
> After the `malloc()` is consumed, the size //expression// dies and gets cleaned up from the //Environment//. The symbol will only die if the value wasn't put into the //Store// in the process of modeling the statement that consumes the `malloc()` expression (such as an assignment). But `trackExpressionValue()` can only track live (active) expressions.
I see. Now I have tried out what we have. The `trackExpressionValue()` has a lookup to see where is the expression available:

```lang=c++
/// Find the ExplodedNode where the lvalue (the value of 'Ex')                   
/// was computed.                                                                
static const ExplodedNode* findNodeForExpression(const ExplodedNode *N,          
                                                 const Expr *Inner) {            
  while (N) {                                                                    
    if (N->getStmtForDiagnostics() == Inner)                                     
      return N;                                                                  
    N = N->getFirstPred();                                                       
  }                                                                              
  return N;                                                                      
}
```

from that point the expression was alive, and tracking is fine.

-------------

The `InnerPointerChecker` has introduced a place: `AllocationState.h` to communicate with the `MallocBugVisitor`. I believe this is the simplest way to communicate.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:184
+  if (IsFix) {
+    if (Optional<std::string> SizeStr = getSizeExprAsString(Call, CallC, C)) {
+      renameFunctionFix(UseSafeFunctions ? "gets_s" : "fgets", Call, *Report);
----------------
NoQ wrote:
> Charusso wrote:
> > NoQ wrote:
> > > Also, which is probably more important, you will never be able to provide a fixit for the malloced memory case, because there may be multiple execution paths that reach the current point with different size expressions (in fact, not necessarily all of them are malloced).
> > > 
> > > Eg.:
> > > ```lang=c
> > > char *x = 0;
> > > char y[10];
> > > 
> > > if (coin()) {
> > >   x = malloc(20);
> > > } else {
> > >   x = y;
> > > }
> > > 
> > > gets(x);
> > > ```
> > > 
> > > If you suggest replacing `gets(x)` with `gets_s(x, 20)`, you'll still have a buffer overflow on the else-branch on which `x` points to an array of 10 bytes.
> > This checker going to evolve a lot, and all of the checked function calls have issues like that. I do not even think what else issues they have. I would like to cover the false alarm suppression when we are about to alarm. Is it would be okay? I really would like to see alarms first.
> > 
> > For example, I have seen stuff in the wild so that I can state out 8-param allocators and we need to rely on the checkers provide information about allocation.
> *summons @Szelethus*
> 
> Apart from the obviously syntactic cases, you might actually be able to implement fixits for the situation when the reaching-definitions analysis displays exactly one definition for `x`, which additionally coincides with the allocation site. If that definition is a simple assignment, you'll be able to re-run the reaching definitions analysis for the RHS of that assignment. If that definition comes from a function call, you might be able to re-run the reaching definitions analysis on the return statement(s) of that function (note that this function must have been inlined during path-sensitive analysis, otherwise no definition in it would coincide with the allocation site). And so on.
> 
> This problem sheds some light on how much do we want to make the reaching definitions analysis inter-procedural. My current guess is that we probably don't need to; we'd rather have this guided by re-running the reaching-definitions analysis based on the path-sensitive report data, than have the reaching-definitions analysis be inter-procedural on our own.
That is a cool idea! I hope @Szelethus has time for his project.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69813/new/

https://reviews.llvm.org/D69813