[cfe-dev] [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org
Wed Aug 19 07:59:48 PDT 2020


Hi!

Since the metadata symbol describes a property of the region I think in
this example it might make sense to duplicate the metadata symbol for dst.
But I can also imagine code using the length of a dead string which would
be nice to support.

I also wonder what is the advantage of not using a conjured symbol.

Regards,
Gabor


On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <benicsbalazs at gmail.com> wrote:

> It turns out that if the region of `src` is alive, then the required
> metadata symbol will be kept alive as well.
> Here is the modified example:
> void strcat_symbolic_src_length(char *src) {
>   char dst[8] = "1234";
>   strcat(dst, src);
>   clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
>   (void)*dst;
>   *(void)*src;* // Now we keep the `src` alive, thus any metadata symbols
> to that region will be alive as well at the eval call.
> }
> ```
>
> It seems slightly confusing to me that depending on that `src` is used
> later or not, the `clang_analyzer_eval(strlen(dst) >= 4)` will show either
> `TRUE` or `UNKNOWN`.
> I think it should always give `TRUE` as an answer.
>
> Note that a metadata symbol is alive only if marked in use AND its region
> is also alive.
> Without the `(void)*src;` the region of `src` is dead, thus the symbol
> ($meta{src} + 4) representing the cstring length of the region `dst` will
> be dead too.
>
> Since this problem was caused by the handling of metadata symbols,
> shouldn't we use conjured ones instead?
> In that way, we would decouple the liveness of the cstring length of a
> region and the region itself.
>
> What is the use-case for using a metadata symbol instead of a conjured one?
>
> Artem Dergachev <noqnoqneo at gmail.com> ezt írta (időpont: 2020. aug. 7.,
> P, 0:05):
>
>>
>>
>> On 8/6/20 1:38 PM, Balázs Benics wrote:
>>
>> Gábor
>> > How much work would it be to prototype keeping these expressions alive
>> and measuring performance and memory implications?
>> I'm not sure, since I'm not really experienced in this liveness stuff. I
>> will try it.
>>
>> Artem
>> > It sounds to me as if putting metadata symbols into the live set
>> directly would have worked just fine.
>> Ehm, I' not sure about this.
>> If you look my example closely, you can note that the CStringChecker
>> maps directly the SymRegion{reg_$0<char * src>} to the meta_$2{SymRegion{reg_$0<char
>> * src>},unsigned long} symbol.
>> So the SymbolReaper::markInUse will in fact place that meta_$2 symbol in
>> the Live set.
>> However later, when you query the SymbolReaper if the mentioned meta_$2
>> symbol is dead or not, it will answer you that it is *dead*.
>>
>>
>> That's not how that works.
>>
>> markInUse() doesn't put anything into the live set, it puts things into
>> an auxiliary "metadata-in-use" set:
>> https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396
>>
>> On the other hand, isDead()/isLive() has to return true if the symbol is
>> present in the live set. In fact, that's the first thing it checks for:
>> https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437
>>
>>
>> I'm quoting the related trace log:
>>
>>> [...]
>>> *CStringChecker::**checkLiveSymbols marks
>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' in use*
>>>
>>> # Even we marked the given symbols in use, we still removes them for
>>> some reason...
>>> CStringChecker::checkDeadSymbols finds the
>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so
>>> removes the mapping from 'dst'
>>> *CStringChecker::**checkDeadSymbols finds the
>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes
>>> the mapping from 'SymRegion{reg_$0<char * src>}'*
>>>
>>
>> Why does the ExprEngine conjure a return symbol, if an evalCall already
>> evaluated the call?
>>
>> Artem
>> > See how RegionStore does that within `ScanReachableSymbols::scan(const
>> SymExpr *)`
>> I managed to implement a metadata collector visitor using the new
>> SymExprVisitor infrastructure, in just a couple of lines of code. I was
>> amazed by that :)
>>
>> Artem Dergachev <noqnoqneo at gmail.com> ezt írta (időpont: 2020. aug. 6.,
>> Cs, 19:04):
>>
>>> I- i- i was about to reply to that!
>>>
>>> I don't know why metadata-in-use is a thing at all. It sounds to me as
>>> if putting metadata symbols into the live set directly would have worked
>>> just fine. If you find any interesting counterexamples please let me know.
>>>
>>> Apart from that, indeed, the correct way to implement checkLiveSymbols
>>> when you're tracking arbitrary symbols is to iterate over these arbitrary
>>> symbols and mark all sub-symbols as live. See how RegionStore does that
>>> within `ScanReachableSymbols::scan(const SymExpr *)`. I.e., the following
>>> example works correctly and i expect CStringChecker to work similarly:
>>>
>>> ```
>>> int conjure();
>>>
>>> int foo() {
>>>   int x = conjure();
>>>   clang_analyzer_warnOnDeadSymbol(x);
>>>   return x + 1;
>>> }
>>>
>>> void bar() {
>>>   int y = foo(); // At this point `conj_$2` is no longer directly
>>> present in the state; only `conj_$2 + 1` is.
>>>   (void)y;
>>> } // But despite that, `conj_$2` only dies here.
>>> ```
>>>
>>>
>>> On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
>>>
>>> +Artem
>>>
>>> It would be great if the analyzer could reason about code like that. I
>>> think Artem is the most competent in these liveness related problems.
>>> Aside from performance, I do not see any downside for keeping the whole
>>> symbolic expression alive after markInUse was called on it (hopefully Artem
>>> corrects me if I'm wrong).
>>> But mainly due to constraint solver limitations it might not make sense
>>> to keep arbitrarily complex expressions alive.
>>>
>>> @Balázs Benics <benicsbalazs at gmail.com>
>>> How much work would it be to prototype keeping these expressions alive
>>> and measuring performance and memory implications?
>>>
>>>
>>> On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> If you are really curious, here is some context.
>>>>
>>>> Imagine the following code (test.c):
>>>> ```
>>>> typedef typeof(sizeof(int)) size_t;
>>>>
>>>> void clang_analyzer_eval(int);
>>>> char  *strcat(char *restrict s1, const char *restrict s2);
>>>> size_t strlen(const char *s);
>>>>
>>>> void strcat_symbolic_src_length(char *src) {
>>>>   char dst[8] = "1234";
>>>>   strcat(dst, src);
>>>>   clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
>>>>   (void)*dst;
>>>> }
>>>> ```
>>>>
>>>> One would expect that the 'strlen(dst) >= 4' is TRUE, but it returns
>>>> UNKOWN.
>>>>
>>>> After doing a little bit of investigation - and debug prints - I came
>>>> up with the following trace.
>>>>
>>>> ---
>>>>
>>>> # In the CStringChecker::evalStrcat creates a metadata symbol
>>>> representing the cstring length of the region pointed by 'src'.
>>>> created metadata symbol 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}'
>>>>
>>>> # After the evalStrcat evaluated the call, the state contains the
>>>> necessary mapping that the 'dst' points to a cstring which is 4 + meta$4
>>>> long.
>>>> # Note that meta$4 represents the cstring length of the region pointed
>>>> by 'src'.
>>>> # So far so good. We know that resulting cstring length precisely.
>>>> strcpy common END State.dump: "program_state": {
>>>>   [...]
>>>>   "checker_messages": [
>>>>     { "checker": "unix.cstring.CStringModeling", "messages": [
>>>>       "CString lengths:",
>>>>       "dst: (meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) +
>>>> 4U",
>>>>       "SymRegion{reg_$0<char * src>}: meta_$2{SymRegion{reg_$0<char *
>>>> src>},unsigned long}",
>>>>       ""
>>>>     ]}
>>>>   ]
>>>> }
>>>>
>>>> # We mark all symbols as live in the cstring length map.
>>>> # At least we think so...
>>>> CStringChecker::checkLiveSymbols marks '(meta_$2{SymRegion{reg_$0<char
>>>> * src>},unsigned long}) + 4U' in use
>>>> CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char *
>>>> src>},unsigned long}' in use
>>>> CStringChecker::checkLiveSymbols marks 'meta_$2{SymRegion{reg_$0<char *
>>>> src>},unsigned long}' in use
>>>>
>>>> # Even we marked the given symbols in use, we still removes them for
>>>> some reason...
>>>> CStringChecker::checkDeadSymbols finds the
>>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' as dead; so
>>>> removes the mapping from 'dst'
>>>> CStringChecker::checkDeadSymbols finds the
>>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}' as dead; so removes
>>>> the mapping from 'SymRegion{reg_$0<char * src>}'
>>>>
>>>> # Now that state does not contain the cstring length of the region
>>>> pointed by 'dst'.
>>>>
>>>> ---
>>>>
>>>> Further investigation showed that even if I would visit all the sub
>>>> SymExprs looking for SymbolMetadata would not help.
>>>> SymbolReaper::isDead would still show that the
>>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long}) + 4U' SymExpr is
>>>> dead.
>>>>
>>>> ---
>>>>
>>>> How can we preserve such metadata information?
>>>>
>>>> ---
>>>>
>>>> You can also reproduce this following these steps:
>>>>
>>>> Apply the add-debug-prints.patch on top of 7f1556f292ccfd80c4ffa986d5b849f915e5cd82
>>>> "Fix typo: s/epomymous/eponymous/ NFC".
>>>> Analyze the 'test.c' file using this command:
>>>> ./bin/clang -cc1 -internal-isystem lib/clang/12.0.0/include
>>>> -nostdsysteminc -analyze -analyzer-constraints=range -setup-static-analyzer
>>>> -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection
>>>> -analyzer-config eagerly-assume=false test.c
>>>>
>>>> Balázs Benics <benicsbalazs at gmail.com> ezt írta (időpont: 2020. aug.
>>>> 5., Sze, 10:29):
>>>>
>>>>> Why does the SymbolReaper::markInUse only work on SymbolMetadata
>>>>> symbols?
>>>>>
>>>>> void SymbolReaper::markInUse(SymbolRef sym) {
>>>>>   if (isa<SymbolMetadata>(sym))
>>>>>     MetadataInUse.insert(sym);}
>>>>>
>>>>> I think it is flawed if the Symbol is a SymIntExpr holding an
>>>>> expression tree referring to SymbolMetadata symbols. In such case, those
>>>>> symbols would not be kept alive - causing some confusion on the checker
>>>>> developers' side and potentially losing some information about the analysis.
>>>>>
>>>>> Should we walk the expression tree instead of the current
>>>>> implementation?
>>>>> What performance impact should we expect by doing so?
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Balazs.
>>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing listcfe-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200819/b6dc39fb/attachment-0001.html>


More information about the cfe-dev mailing list