[PATCH] D70411: [analyzer] CERT STR rule checkers: STR31-C

Wed Jul 15 21:16:10 PDT 2020

Charusso added a comment.

Thanks for the reviews!

In D70411#2153562 <https://reviews.llvm.org/D70411#2153562>, @Szelethus wrote:

> Please do not bypass the previous comments that hadn't reached a conclusion -- littering inlines about miscellaneous stuff at this stage does more harm then good, and derails the discussion.

Ugh, it is a long story. I really wanted to create the best and trendiest checker of all time with Artem so all the comments are made for him at first. We have had the exact opposite sense what we are trying to achieve here, that is why we left the project dead.

Hopefully I get your suggestion right and you want to make sure I have addressed all the comments. Let me begin with a suggestion as well:

> If you are interested in the solution, please do not care about the problem, but the solution.

- non-literal advice from //It's Not About the Shark: How to Solve Unsolvable Problems//

-------------

Let me rewind / summarize what problems happened in chunks if you would Ctrl+F5 for certain quotes and what we did in chronological order:

> You're bringing in a completely brand-new machinery here, could you explain how it works and why do you need it?

I have attached multiple bug reports to the program state (by registering a map) and invalidated all the non-fatal reports in chain on the bug-path to a given fatal-report to support two functionality of the checker.

That two functionality was a complete stupidity it turned out, I have left only the necessary one.

> I think it would really help if you draw a state machine for the checker

Because I have removed the complexity of the checker now it is easy to understand and it does not require an appendix.

> But it doesn't necessarily mean that static analysis tools can be built by generalizing over the examples from the CERT wiki.

Artem refuse the usefulness of the CERT rules from the static analysis perspective, which I agree with.

> "if string length metadata is tainted (in particular, if the string itself is tainted) and is potentially larger than the destination buffer size (eg., unconstrained), warn"

Artem suggests to develop taint analysis, which I do not want to develop, but his concerns are real.

> Traditionally checkers either warn immediately when they can detect an error, or assume that the error has not happened.

That was the near the point when I have realized we need to warn on the function calls immediately. The buffer overflow is immediately a security issue, but I was dumb to realize.

> I took a look at first 5 and in all of them the code does in fact ensure that the buffer space is sufficient, but in 5 different ways.

Artem believes the arrays are cool because they cannot overflow. For example we obtain an IP address, which has enough space as `char[64]` our mental model suggests. I believe any overflow is not cool in a security point of view: https://twitter.com/TwitterSupport/status/1283526400146837511

Now I have accepted Artem's suggestion and I do not care about arrays because the Analyzer cannot model the constraints of the array size yet / taintness. If someone working with a plain char array that could be easy to detect and measure. For the security problems we want to catch allocations.

> The known size does not mean that the string is going to be null-terminated.
>  The problem is not the size, but the missing '\0' (which you can have multiple of at any point).

I still wanted to care about null-termination, however the entire checker should be about overflow.

> - "The checker warns every time it finds an execution path on which unintended behavior occurs"
> - "The checker enforces a certain coding guideline on the user": (say, "don't pass the string by pointer anywhere before you null-terminate it")
> - If you rely on the existing taint analysis infrastructure and make a good check, that'll be wonderful and would further encourage us to move taint analysis out of alpha.

Artem tried to create a direction for the checker, it was pretty complex. I have refused again, because all the people null terminate by hand.

> This text may be hard to understand.

Balazs get confused by the second behavior, like Artem did for a while. I have tried to rephrase the documentation. With Balazs' help we made the documentation cool (Thanks again!).

> Get rid of the secondary behavior for now.

That was the time when I have learnt complexity is a huge issue and people will get confused. I have removed the entire secondary behavior of null termination checking and moved some logic to STR32-C.

-------------

The conclusion: Now this checker is very strict and hopefully enough simple. I hope we could design the future of checker-writing.

================
Comment at: clang/include/clang/StaticAnalyzer/Checkers/Checkers.td:859
+
+def Str31cChecker : Checker<"31c">,
+  HelpText<"SEI CERT checker of rules defined in STR31-C">,
----------------
baloghadamsoftware wrote:
> Maybe we could have more descriptive names for these checkers and mention the number of the rule in the `HelpText` only.
The naming is designed with Aaron and Artem, so that it is consistent with the Clang-Tidy naming. We really need to support aliasing names here.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/CMakeLists.txt:21
   CastValueChecker.cpp
+  cert/StrChecker.cpp
   CheckObjCDealloc.cpp
----------------
baloghadamsoftware wrote:
> Are there so many things common among checkers for SEI CERT string rules that the best option is to implemenet them in a single huge class/file? Would not it be more appropriate to create a library instead for common functions and then implement each rule in a separate checker class? Huge classes are difficult to understand.
> things common among checkers for SEI CERT string rules
Well, yes.

> best option is to implemenet them in a single huge file
Sadly that is our standard now and there is no API to make it well-engineered by default.

This is my first real checker and I could not really be the pioneer of creating the new standard of checker writing as I am not that experienced on my own. I believe if all the Ericsson people would gather and brainstorm for a long period of time and create the initial better way of doing checkers as a design document we could improve such document as the whole community. If some extent of consensus met by most of the developers, we could start to create a better environment, which already would create the design how should we split up the files. I do not see the immediate gains of splitting up stuff as long as we do not create such design to make sure we do not need to split up or move code anymore. 

I wanted to follow the latest trends with that checker and eventually come up with the design how to write checkers. After that I have realized it is way too much work for an individual and it is a very long run where such change is necessary: in function summaries (http://lists.llvm.org/pipermail/cfe-dev/2015-October/045730.html)

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:66
+  void checkStrcpy(const CallEvent &Call, const CallContext &CallC,
+                   CheckerContext &C) const;
+  /// \}
----------------
baloghadamsoftware wrote:
> I would avoid reusing the prefix `check` for functions that are not inherited from `Checker`. Alternatives could be `handle`, `process` etc. for modeling and `verify`, `validate` etc. for checking for error.
The core checkers which I have seen use the prefix `check` which I became in love with. I have not split the modeling/checking logic up enough well so there is no need to use prefix `verify/validate` for now, but I prefer to use the prefix `eval` for the same reason.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:100
+/// Returns the interesting region of \p V.
+static const MemRegion *getRegion(SVal V) {
+  const MemRegion *MR = V.getAsRegion();
----------------
baloghadamsoftware wrote:
> Do we really need a separate utility function for that?
Yes, it is handy.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:136
+  llvm::raw_svector_ostream Out(Msg);
+  Out << '\'' << CallName << "' could write outside of '" << printPretty(Arg, C)
+      << '\'';
----------------
baloghadamsoftware wrote:
> Create `SmallString`, use a stream to print into it, create another one in `printPretty`, convert it to `std::string` and then copy it again to the original `SmallString`? Why?
I like the style of the stream chained together and the easy usage of `printPretty()`. Basically that is why.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:165
+  if (EnableStr31cChecker)
+    createCallOverflowReport(Call, CallC, C);
+}
----------------
baloghadamsoftware wrote:
> `sprintf` not non-compliant generally. You should calculate the possible maximal length of the string and compare it to the size of the target region.
I really wanted to check for a single `%s`, but somehow I have removed it, facepalm. Thanks!

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:180
+  // FIXME: Handle multiple buffers.
+  if (FormatExpr->getString() != "%s")
+    return;
----------------
baloghadamsoftware wrote:
> Checking for `%s` is not enough. `%2s` may also be a problem if the buffer is `1` character long.
Yes, there are other problems to check. I wanted to catch the given examples at first to make sure the patch is easy to understand.

Thanks for the idea, we definitely need to use more of the dynamic size support. I have added this extra feature for both `fscanf()` and `sprintf()`.

================
Comment at: clang/lib/StaticAnalyzer/Checkers/cert/StrChecker.cpp:55
+  // they can cause a not null-terminated string. In this case we store the
+  // string is being not null-terminated in the 'NullTerminationMap'.
+  //
----------------
baloghadamsoftware wrote:
> Charusso wrote:
> > balazske wrote:
> > > The functions `gets`, `strcpy` return a null-terminated string according to standard, probably `sprintf` too, and the `fscanf` `%s` output is null-terminated too even if no length is specified (maybe it writes outside of the specified buffer but null terminated).
> > From where do you observe such information? I have not seen anything `strcpy()`-specific in the standard. My observation clearly says that, if there is not enough space for the data you *could* even write to overlapping memory, or as people says, you could meet nasal demons. It is called *undefined* behavior.
> From `cppreference.com`:
> 
> `strcpy()` -- //Copies the null-terminated byte string pointed to by src, including the null terminator, to the character array whose first element is pointed to by dest.//
> 
> `gets()` -- //https://en.cppreference.com/w/c/io/gets//
> 
> `scanf`, `fscanf`, `sscan` format `%s` -- //If width specifier is used, matches up to width or until the first whitespace character, whichever appears first. Always stores a null character in addition to the characters matched (so the argument array must have room for at least width+1 characters)//
> 
> Thus @balazske is completely right. All these functions add the null-terminator. Of course, if the string does not fit, it overwrites the memory after the buffer which is undefined behavior. But the null-terminator is definitely added.
Well, let assume that we rely on the cppreference page from now on as our best approximation of the required behavior. Thanks for the notes.

-------------

`strcpy()`:

> Copies the null-terminated byte string pointed to by src, including the null terminator
> All these functions add the null-terminator.
*Copies* does not mean it adds the null-terminator to end of the `dest` explicitly. It is possible the `src` is already not null-terminated so there is no way to copy the `'\0'`, and so that the STR32-C rule made:
https://wiki.sei.cmu.edu/confluence/display/c/STR32-C.+Do+not+pass+a+non-null-terminated+character+sequence+to+a+library+function+that+expects+a+string

-------------

I had not checked the other functions in the standard. `gets()` and `fscanf()` is going to be null-terminated according to cppreference and the IEEE standard, cool. I will release my fix on that topic in D71033.

================
Comment at: clang/test/Analysis/cert/str31-c.cpp:27
+namespace test_gets_good {
+enum { BUFFERSIZE = 32 };
+
----------------
baloghadamsoftware wrote:
> Why do we need a different buffer size here? And why `enum`?
I have copied the examples of the rule STR31-C as-is. I like the variety of buffer sizes.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70411/new/

https://reviews.llvm.org/D70411