[PATCH] D70411: [analyzer] CERT: StrChecker: 31.c

Csaba Dabis via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Dec 6 16:16:36 PST 2019


Charusso added a comment.

Please let us measure your examples at first to have a consensus what we would like to see from this checker.

In D70411#1769803 <https://reviews.llvm.org/D70411#1769803>, @NoQ wrote:

> In D70411#1769628 <https://reviews.llvm.org/D70411#1769628>, @Charusso wrote:
>
> > F10966837: str31-c.tar.gz <https://reviews.llvm.org/F10966837>
>
>
> These reports seem to confirm my point. I took a look at first 5 and in all of them the code does in fact ensure that the buffer space is sufficient, but in 5 different ways. You most likely won't be able to enumerate all possible ways in which the user can ensure that the destination buffer size is sufficient.


I believe I can enhance the checker with the help of `NoteTags` and give more information for the user why the buffer space is Not sufficient. The size property's upper-bound is not that difficult to measure, I think we can implement this. For now, let me analyze each report.

---

> F10967274: content_encoding.c_5ae4f30ce29f14441139e7bbe20eeaaa.plist.html <https://reviews.llvm.org/F10967274>
>  The source string is taken from a global table and therefore has known maximum size.
>  F10967280: imap.c_fd80e0804acd9e7ecb9c2483151625a9.plist.html <https://reviews.llvm.org/F10967280>
>  The source string is a literal `'*'`.

When you encounter a member that is when the hand-waving begins. I have made a dump to see what we have. I have not seen any immediate way to connect the dynamic size and the members. For now we have that:
The source region is:
`SymRegion{reg_$36<const char * SymRegion{derived_$35{conj_$32{int, LC7, S161233, #1},Element{encodings,0 S64b,const struct content_encoding_s *}}}.name>}`
and the allocation's dynamic size is:
`extent_$41{SymRegion{conj_$34{void *, LC7, S161233, #1}}}`

My assumption is that, the size of a member is very arbitrary and we cannot track the member's allocation, so we even could drop every of these reports as being useless. Because I make this checker alpha I am fine with that assumption of members, and we definitely need better ideas.

---

> F10967277: ftp.c_8cb07866de61ef56be82135a4d3a5b7e.plist.html <https://reviews.llvm.org/F10967277>
>  The source string is an IP address and port, which has a known limit on the number of digits it can have.

The known size does not mean that the string is going to be null-terminated. For example the VIM project has one single global macro to serve every allocation's size, and that is why the CERT rule made. If the attacker passes a 10k characters long string and you allocate 10k+1 characters for the `'\0'` you can work with the data safely. Of course you would need to check some upper bound, but that is another story. The key is, any arbitrary buffer size could be a problem. Here:

  954 | size_t addrlen = INET6_ADDRSTRLEN > strlen(string_ftpport) // Assuming the condition is true
                         ? INET6_ADDRSTRLEN
                         : strlen(string_ftpport)

`addrlen` is an arbitrary macro size `INET6_ADDRSTRLEN`, which not necessarily could hold `string_ftpport` in `strcpy(addr, string_ftpport);`. As a human-being you know the most appropriate size, yes. But the program could be ill-formed very easily so we warn on the arbitrary-length path. Here you do not see something like `addr[INET6_ADDRSTRLEN - 1] = '\0'` so when the string is read you cannot be sure whether it is null-terminated. If line 954 would be evaluated to false, allocating `calloc(addrlen + 1, 1)` so when `addrlen` is `strlen(string_ftpport)` is fine of course, that is the appropriate size to store `string_ftpport`.

---

> F10967295: tftp.c_eee38be8d783d25f2e733fa3740d13fc.plist.html <https://reviews.llvm.org/F10967295>
>  There's an if-statement that checks that the storage size is sufficient.

Well, this is the most suspicious report of all the time. Let me reconstruct what is going on:

  513 | sbytes += tftp_option_add(state, sbytes,
                                  (char *)state->spacket.data + sbytes, // warning
                                  TFTP_OPTION_TSIZE);

My guess was that to change the `check::PostCall` to `check::PreCall`, but if I remember right the same report occurred. Here we return from the function call `tftp_option_add`, then we warn on its argument which already has been passed to the call. We should not warn when the array is passed to a function which we already have finished reading from. I have not seen any immediate solution, so I count this as an internal ~~bug~~ feature.

---

> F10967300: tool_dirhie.c_87dd00a845a927c9b8ed587c6b314af1.plist.html <https://reviews.llvm.org/F10967300>
>  The source string is a token (obtained via `strtok`) of a string that has the same size as the destination buffer.

Let me simplify the report as there is no such feature:

  111 | outlen = strlen(outfile);
  112 | outdup = strdup(outfile);
  116 | dirbuildup = malloc(outlen + 1);
  
  125 | tempdir = strtok(outdup, PATH_DELIMITERS);
  136 | if(outdup == tempdir)
  138 |   strcpy(dirbuildup, tempdir);

Here we see both `tempdir`'s and `dirbuildup`'s size is based on `strlen(outfile)` but `strtok()` strikes in and we conjure a new symbol for `dirbuildup` so the connection is lost unfortunately.

---

---

Please let us brainstorming here.

In D70411#1769786 <https://reviews.llvm.org/D70411#1769786>, @NoQ wrote:

> A while back I had a look into implementing some of the CERT checkers and my vague recollection from these attempts was that it's a bad idea to take CERT examples literally. Most of the CERT rules are explained in an informal natural language and it makes CERT great for demonstrating common mistakes that programmers make when they write in C/C++. Humans can easily understand the problem by reading CERT pages. But it doesn't necessarily mean that static analysis tools can be built by generalizing over the examples from the CERT wiki.


Well, if the member's allocation and the expression-value-tracking would not be hand-waving, such a simple checker could serve a lot. I see your point and you are right the rules are not made for tools. Our tool already knows most of the functionality what is needed for the STR rules. It is started as an experiment and it is working enough well, in my point of view. The member-value-tracking and the lack of standard function modeling of course affects this checker, but that issue affects every other checker as well. If we call them non-alpha, it could be even non-alpha.

> So i suggest that we make one more step back and agree upon what exactly are we checking here. I.e., basically, agree on the tests. Because right now it looks like you're trying to blindly generalize over the examples: //"The CERT example has a `strcpy()` into a fixed-size buffer, let's warn on all `strcpy()`s into fixed-size buffers!"//.

Just for clarification, I made the entire `MemoryBlockRegion` stuff because I do not care how do you allocate a memory block (even a `string` is counted as that).

> This way your check does indeed pass the tests, but it doesn't make sense both from the rule's perspective (the rule never said that it's wrong to `strcpy()` into a fixed-size buffer, it only said that you should anyhow guarantee that the storage is sufficient) and from the user's perspective (you won't be able to reduce the gap between false positives and false negatives enough for the check to be useful, even with the subsequent whack-a-mole of adding more heuristics, because what the check is checking is relatively orthogonal to the actual problem you're trying to solve).

There is not that much heuristic:

- If you use the unsafe calls and before reading the string you manage to null-terminate it (inject `'\0'`), it would be a false positive.
- If you have a blob of memory without considering the length of the stuff it will store, it is a true positive.

Please note that, truncation is fine:

> To prevent such errors, either limit copies through truncation or, preferably, ensure that the destination is of sufficient size

~ from the rule's page <https://wiki.sei.cmu.edu/confluence/display/c/STR31-C.+Guarantee+that+storage+for+strings+has+sufficient+space+for+character+data+and+the+null+terminator>

> For example, in the example titled "Noncompliant Code Example (argv)", it is explicitly stated that the problem is not only that the buffer is fixed-size and `strcpy()` is made, but also that the original string //is controlled by the attacker//. The right analysis to catch such issues is //taint analysis//. It's a typical taint problem. I honestly believe you should try to solve it by combining the taint checker with the string checker: "if string length metadata is tainted (in particular, if the string itself is tainted) and is potentially larger than the destination buffer size (eg., unconstrained), warn". With the recent advancements in the development of the taint checker, i think you can get pretty far this way without constantly struggling with false positives.

This checker indeed would require an additional taint analysis, but as we do not have a non-alpha variant, I am fine to call every non-null-terminated string reading an issue (as it is), whatever is the source. That is a generalization, yes, in the best possible manner what we have got today. Also I am not a big fan of relying on non-alpha stuff and I am sure I have defined my checker enough well to catch real issues.

---

---

Let me summarize:

- There is a lot more to do, we have seen too many issues to call this a finished checker.
- The problem is not the size, but the missing `'\0'` (which you can have multiple of at any point).
- We heavily need to swap the value-tracking with `NoteTags` to make this useful.



================
Comment at: clang/test/Analysis/cert/str31-c-fp-suppression.cpp:33-34
+  // string, so that we could see whether the string is null-terminated on
+  // every path or not. Because the 'src' could be not null-terminated this
+  // report is fine, but if 'src' is null-terminated this report is a false
+  // positive because the 'src' fits into 'dest' with the null terminator
----------------
NoQ wrote:
> Mmm, no, it's not fine. The warning is saying something that's not correct. Even if `src` is not null-terminated, under the assumption that no out-of-bounds read UB occurs in the `strcpy()` call, `dest` is going to always be null-terminated and have a `strlen` of at most 127. And by continuing our analysis after `strcpy()`, we inform the user that we assume that no UB has happened on the current execution path yet.
> 
> Traditionally checkers either warn immediately when they can detect an error, or assume that the error has not happened. For example, when we dereference a pointer, if the pointer is not known to be null, we actively assume that it's non-null. Similarly, if the `src` string is not null-terminated, we should warn at the invocation of `strcpy`; if it's not known whether it's null-terminated or not, we should actively assume that it's null-terminated at `strcpy`.
> 
> Also, I usually don't agree with the statements like "This report helped us find a real bug, therefore it's a true positive". You can find a bug in literally any code if you stare at it long enough! I'm glad you found the bug anyway, but if the report is not pointing at the problem directly, we're not doing a great job.
Well, I think we could check the range info and catch if we have an infeasible case. For now on one of the paths it looks like the string is not null-terminated, where we warn. Of course we need better modeling, that is why I have stated out with a FIXME.

I totally agree with hand-waving "true positives", but this is very unlikely to happen, I just came out with that example to state out the `CStringModelingChecker` is not enough in its current shape.


================
Comment at: clang/test/Analysis/cert/str31-c-fp-suppression.cpp:51-53
+  strcpy(dest, src);
+  do_something(dest);
+  // expected-warning at -1 {{'dest' is not null-terminated}}
----------------
NoQ wrote:
> Looks like a false positive. Consider the following perfectly correct (even if a bit inefficient) code that only differs from the current code only in names and context:
> ```lang=c++
> void copy_and_test_if_short(const char *src) {
>   char dest[128];
>   strcpy(dest, src);
>   warn_if_starts_with_foo(dest);
> }
> 
> void copy_and_test(const char *src) {
>   if (strlen(src) < 64)
>     copy_and_test_if_short(src);
>   else
>     copy_and_test_if_long(src);
> }
> ```
That is why I have asked whether we have a way to say "when the string is read". I like that heuristic for now, because any kind of not null-terminated string is the root of the evil for [[ https://wiki.sei.cmu.edu/confluence/display/c/STR31-C.+Guarantee+that+storage+for+strings+has+sufficient+space+for+character+data+and+the+null+terminator | STR31-C ]].

In this case the attacker is the programmer who sends a possibly not null-terminated string around and the function which receives an arbitrary input would carry the issue.

Thanks for the example, I wanted to cover ranges, but I have forgotten it, because in the wild peoples do not check for bounds. Copy-pasted.


================
Comment at: clang/test/Analysis/cert/str31-c-notes.cpp:22
+  if (gets(buf)) {}
+  // expected-note at -1 {{'gets' could write outside of 'buf'}}
+  // expected-note at -2 {{Assuming the condition is false}}
----------------
NoQ wrote:
> I still believe that this should be *the* warning in this code. This is already broken code.
No, it is not broken, sadly. I would say it is broken, but this checker tries to be smarter than `grep`. Truncation is fine for [[ https://wiki.sei.cmu.edu/confluence/display/c/STR31-C.+Guarantee+that+storage+for+strings+has+sufficient+space+for+character+data+and+the+null+terminator | STR31-C ]].

That change in my mindset made this checker possible when I have dropped the idea of warn on function calls.


================
Comment at: clang/test/Analysis/cert/str31-c-notes.cpp:26-27
+
+  free(buf);
+  // expected-warning at -1 {{'buf' is not null-terminated}}
+  // expected-note at -2 {{'buf' is not null-terminated}}
----------------
NoQ wrote:
> It is not wrong to free a buffer that isn't null-terminated. We shouldn't warn here.
That is when the string is read. We could have a list of function calls which are non-problematic. For now I am fine with that note, because you do not allocate something without reading it before freeing it. It is just an arbitrary test for now.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70411/new/

https://reviews.llvm.org/D70411





More information about the cfe-commits mailing list