[PATCH] D120236: Add more sources to Taint analysis

Balázs Benics via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Feb 21 05:46:31 PST 2022


steakhal added a comment.

Sorry for the wall-of-nits. Overall, looks great.



================
Comment at: clang/docs/analyzer/checkers.rst:2358
 Default sources defined by ``GenericTaintChecker``:
-``fdopen``, ``fopen``, ``freopen``, ``getch``, ``getchar``, ``getchar_unlocked``, ``gets``, ``scanf``, ``socket``, ``wgetch``
+ ``_IO_getc``, ``fdopen``, ``fopen``, ``freopen``, ``get_current_dir_name``, ``getch``, ``getchar``, ``getchar_unlocked``, ``getcw``, ``getcwd``, ``getgroups``, ``gethostname``, ``getlogin``, ``getlogin_r``, ``getnameinfo``, ``getopt``, ``getopt_long``, ``getopt_only``, ``gets``, ``getseuserbyname``, ``readlink``, ``scanf``, ``scanf_s``, ``socket``, ``wgetch``
 
----------------
typo/dup?
I cannot recognize the `getcw()` call. Could you please refer to the specification or an instance where it was defined?


================
Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:546-548
       {{"gets"}, TR::Source({{0}, ReturnValueIndex})},
       {{"scanf"}, TR::Source({{}, 1})},
+      {{"scanf_s"}, TR::Source({{}, {1}})},
----------------
If we handle `gets`, `scanf`, we should also model the `*_s` versions as well.
```lang=C
char *gets_s(char *s, rsize_t n);
int scanf_s(const char *restrict format, ...);
int fscanf_s(FILE *restrict stream, const char *restrict format, ...);
int sscanf_s(const char *restrict s, const char *restrict format, ...);
int vscanf_s(const char *restrict format, va_list arg);
int vfscanf_s(FILE *restrict stream, const char *restrict format, va_list arg);
int vsscanf_s(const char *restrict s, const char *restrict format, va_list arg);
```


================
Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:550-552
+      {{"getopt"}, TR::Source({{ReturnValueIndex}})},
+      {{"getopt_long"}, TR::Source({{ReturnValueIndex}})},
+      {{"getopt_long_only"}, TR::Source({{ReturnValueIndex}})},
----------------
IMO these functions are highly domain-specific.
On errors, they return specific/well-defined values e.g. `-1`, `'?'` or `':'`.
That being said, the analyzer does not have this knowledge, thus it will model these as `conjured` symbols.
If these values were modeled as tainted, we would likely introduce the number of false-positives regarding our limited capabilities of modeling the function accurately.

tldr; I'm against these three rules; or alternatively prove that my concerns are not issues on real code bases.


================
Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:556
+      {{"getwd"}, TR::Source({{0, ReturnValueIndex}})},
+      {{"readlink"}, TR::Source({{1, ReturnValueIndex}})},
+      {{"get_current_dir_name"}, TR::Source({{ReturnValueIndex}})},
----------------
We should check `readlinkat` as well.


================
Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:559
+      {{"gethostname"}, TR::Source({{0}})},
+      {{"getnameinfo"}, TR::Source({{2, 4}})},
+      {{"getseuserbyname"}, TR::Source({{1, 2}})},
----------------
In what cases can this function introduce taint?


================
Comment at: clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp:561
+      {{"getseuserbyname"}, TR::Source({{1, 2}})},
+      {{"getgroups"}, TR::Source({{1}})},
+      {{"getlogin"}, TR::Source({{ReturnValueIndex}})},
----------------
>On success, `getgroups()` returns the number of supplementary group IDs. On error, -1 is returned, and `errno` is set appropriately.

According to this, the return value index should be also tainted.


================
Comment at: clang/test/Analysis/taint-generic.c:385
+  struct option long_opts[] = {{0, 0, 0, 0}};
+  int opt = getopt_long_only(argc, argv, "a:b:02", long_opts, &option_index);
+  return 1 / opt; // expected-warning {{Division by a tainted value, possibly zero}}
----------------
Well, this can never return zero.
What we should do is to do a state-split for the failure case; since the application should definitely handle a failure in this part; thus the split would be justified.


================
Comment at: clang/test/Analysis/taint-generic.c:392
+int underscore_IO_getc_is_source(_IO_FILE *fp) {
+  char c = _IO_getc(fp);
+  return 1 / c; // expected-warning {{Division by a tainted value, possibly zero}}
----------------
Sometimes we taint the `fd` and propagate based on that, and othertimes, we simply just return taint.
However, I think it still looks like a better tradeoff this way.
I just wanted to highlight this. Maybe a comment on the CallDescription would be beneficial in describing this discrepancy.


================
Comment at: clang/test/Analysis/taint-generic.c:413
+  system(buf);  // expected-warning {{Untrusted data is passed to a system call}}
+  return 1 / s; // expected-warning {{Division by a tainted value, possibly zero}}
+}
----------------
Well, it can never be zero AFAIK. It might be `-1` on error, or the number of bytes stored into `buf`.
So, I would rather modify the example to `1 / (s + 1)`.


================
Comment at: clang/test/Analysis/taint-generic.c:417
+char *get_current_dir_name(void);
+int get_current_dir_name_is_source() {
+  char *d = get_current_dir_name();
----------------
Please avoid spelling the given function in the name of the test directly in a verbatim manner.
If we would use the `CDF_MaybeBuiltin` matching mode, the `get_current_dir_name_is_source` would match for the `CallDescription {"get_current_dir_name"}` due to the way we fuzzy match for builtins.
Prefixing with `Test` like `testGet_current_dir_name` would be fine though, only the underscores are handled differently.
This applies to the rest of the test cases as well.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120236/new/

https://reviews.llvm.org/D120236



More information about the cfe-commits mailing list