[PATCH] D159105: [analyzer] ArrayBoundCheckerV2 should check the region for taint as well

Wed Aug 30 08:52:41 PDT 2023

steakhal planned changes to this revision.
steakhal added a comment.

In D159105#4627883 <https://reviews.llvm.org/D159105#4627883>, @donat.nagy wrote:

> In D159105#4627785 <https://reviews.llvm.org/D159105#4627785>, @steakhal wrote:
>
>> In D159105#4627724 <https://reviews.llvm.org/D159105#4627724>, @donat.nagy wrote:
>>
>>> I fear that the overzealous handling of tainted data would produce too many false positives in situations when code indexes strings that contain user input and the SA engine cannot understand the logic that calculates the indices. For example if I understand it correctly a function like
>>>
>>>   char f(int n) {
>>>     char *var = getenv("FOO");
>>>     return var[n];
>>>   }
>>>
>>> would be reported as an out of bounds memory access, because the region is tainted and the index value is not known. This is especially problematic on the underflow side (which is introduced by your commit), because I'd assume that it's common to have numeric values (e.g. function arguments) where the programmer knows that they're nonnegative, but the analyzer cannot deduce this.
>>
>> To me, this is a TP. The envvar might not be present, and even if it's present, the relation of the string length of `var` to `n` is not expressed or checked; thus this deserves a report.
>> But I see your concern, however, I cannot think of valid counterexamples, aka. FPs off the top of my head. I'll come back once I see the real results.
>
> Hmm, you're right that my example is a TP as written; however I could imagine similar FPs where we e.g. check that `n < strlen(var)` but don't check that `n` is nonnegative (because that's the responsibility of the caller).
>
> Anyway, let's wait for the test results, they'll be enlightening. (You mostly convinced me that your commit will work, but Murphy's laws are still standing...)

I have the results now, I just need to evaluate it. It turns out to require some scripting, as we sooner detect out-of-bounds accesses (by not allowing to form an lvalue to such) the locations change.
Because of this changed location, I have to filter out the cases when the "same line" has a report "disappearing" and "appearing".
The sooner detection also means that I have more paths killed, thus "in the close proximity below the appearing issue some others disappear - if they are dominated".
I have to manually check these because that diff is sort of intentional; to see what's left and look for unintentional cases.

I can already confirm an issue though, similar to the one you showed, and similar variants:

  char *str = getenv(name);
  if (str && str[0])
    return atoi(str);
  return 0;

This sort of code looks good, and I'll think about how to suppress them.
Probably, because of this, we have a huge relative increase for `Out of bound memory access (index is tainted)` diagnostics (+72%, +447).
The second biggest relative change was for `Out of bound memory access (access exceeds upper limit of memory block)` (+18%, +1862).
These are preliminary numbers, and I'll continue the validation.

>>> Also note that the error message "Out of bound memory access (index is tainted)" becomes misleading after this patch -- but don't spend too much time on resolving this, because right now I'm working on a complete rewrite the message generation to replace the current spartan messages with useful error reporting.
>>
>> I agree, but I'm not sure how much more readable something more accurate like "the location where this pointer points to potentially depends on untrusted tainted data". (Note that this would also work for tainted indexes as well).
>>
>> I must admit, that reporting and diagnostics were low priorities for this patch stack, so I'd focus on addressing logic concerns first for the whole stack and then come back to reporting to make it consistent across the stack.
>
> Improving reporting and diagnostics is also on my TODO list, so we should coordinate progress in that area to avoid redundant work. If you wish to work on it, then I'm happy to review; otherwise I'll do something with it soon (perhaps after you merged these commits to avoid merge conflicts).

I don't plan to work on the checker.

I'll post the result of the evaluation, once I'm happy with the findings. This might need some iterations.

Thanks for the review!
And sorry that I haven't done this evaluation before I posted these for review. Usually, it takes more time for the community to review patches - but this was a pleasant surprise!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D159105/new/

https://reviews.llvm.org/D159105