[clang] [analyzer] CStringChecker: do not branch assuming the arguments of `str(n)cmp` are equal (PR #161644)

Thu Oct 2 06:33:51 PDT 2025

https://github.com/NagyDonat commented:

I see that the old behavior had problems, but the new code also has a problematic aspect: in the case when the pointers may or may not be same (that is, when `StSameBuf && StNotSameBuf` is true) it always assumes that the pointers are not equal (because it assigns `state = StNotSameBuf;` in the line just below the diff context displayed by github).

To test this, you should replace the `// ...` placeholders in your new tests with `clang_analyzer_eval(a == b); // expected-warning{{UNKNOWN}}`. Ideally these checks should pass (if the strings are identical, the pointers may or may not be equal) but with your code they will produce FALSE (because the `StSameBuf` branch is silently discarded).

To avoid losing the "pointers may be equal" case you should remove the statement `state = StNotSameBuf;`. Unfortunately, deleting this statement also has a serious side effect:  after it `clang_analyzer_eval(a == b);` would also be `UNKNOWN` on the branch where `strcmp` is nonzero. (However, this side effect is just "not recording knowledge" which is IMO less problematic than "arbitrarily discarding a case that would be possible".)

This side effect can be easily mitigated in the case when the analyzer can evaluate the comparison of known strings (just re-add `state = StNotSameBuf;` on the branch where the checker records the assumption that the result is GT or LT compared to zero).

Mitigating the side effect is also possible in the case when the analyzer cannot evaluate the comparison (which includes e.g. the tests added in this PR): to do this we should:
- call `auto [StResultEqual, StResultNonEqual] = state->assume(resultVal)` (where `resultVal` is the freshly conjured otherwise unconstrained result value)
  - these states will be both non-null because the value was unconstrained
- add a transition to `StResultEqual` (i.e. the string values equal, pointers may or may not be equal)
- starting from `StResultNotEqual` (when the string values differ) assume that the two pointers are different, then add a second transition to the resulting state (if it is non-null ... in theory it should be always non-null, but there might be some rare corner case were it isn't).

This second mitigation introduces a state split, but IMO this is a much more "benign" state split than the one that was introduced by the old code. Separating "`strcmp` returns zero" and "`strcmp` returns nonzero" is a reasonable after a call to `strcmp` (in fact, this is very similar to the `eagerly-assume` heuristic, which works correctly and is enabled by default) -- while the old "pointers are equal" and "pointers are different" state split was indeed a surprising and unjustified consequence of calling `strcmp`.

_I admit that these suggestions are a bit complex, feel free to ask for details if you don't understand something..._

https://github.com/llvm/llvm-project/pull/161644