[cfe-dev] [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Wed Aug 19 15:02:39 PDT 2020
On 8/19/20 2:18 PM, Artem Dergachev wrote:
> Gabor, we've talked about this two days ago >_> +Nithin because we're
> facing the same decision with the smart pointer checker to which raw
> pointer value would make a lot of sense as SymbolMetadata.
>
> The way SymbolMetadata used in CStringChecker always bothered me. It
> *is* used *as if* it's SymbolConjured: it represents an unknown value
> of an expression of a specific type. As such it doesn't do much more
> than preserve its identity for debugging purposes. We indeed don't
> need another SymbolConjured.
>
> I believe we should remove SymbolConjured's identity elements from
> SymbolMetadata and instead make it work more like SymbolRegionValue
> works in RegionStore. Namely:
>
> 1. Remove statement, location context, block count from
> SymbolMetadata's identity. Let it solely depend on the region.
>
> 2. Get rid of the metadata-in-use set. From now on SymbolMetadata,
> like SymbolRegionValue, is live whenever its region is live.
>
> 3. When strlen(R) is used for the first time on a region R, produce
> $meta<size_t R> as the answer, but *do not store* it in the string
> length map. We don't need to change the state because the state didn't
> in fact change. On subsequent strlen(R) calls we'll simply return the
> same symbol (*because* the map will remain empty), until the string is
> changed.
>
> 4. If the string is mutated, record its length as usual. But if the
> string is invalidated (or the new length is completely unknown), *do
> not remove* the length from the state; instead, set it to
> SymbolConjured. Only remove it from the map when the region dies.
>
> 5. Keep string length symbols alive as long as they're in the map.
>
> This model is obviously correct because it mimics RegionStore which is
> obviously correct (i mean, it's obviously broken beyond repair, but
> *this* part seems to be correct). In particular, this model doesn't
> have any obvious flaws that you point out in this thread. It also
> makes sure that the identity of the state correctly reflects mutations
> in the state. I think it's very much superior to the existing model.
> This also more or less matches my belief of "everything is a Store"
> that i wanted to push on a few years ago (though that one was more
> radical; back then i wanted to make a "shadow" region for a string and
> use the actual SymbolRegionValue of that shadow region instead of the
> SymbolMetadata-as-suggested-in-this-mail).
>
> Balasz, are you willing to implement something like that?
*Balázs.
Sry!!
>
>
> On 8/19/20 7:59 AM, Gábor Horváth wrote:
>> Hi!
>>
>> Since the metadata symbol describes a property of the region I think
>> in this example it might make sense to duplicate the metadata symbol
>> for dst. But I can also imagine code using the length of a dead
>> string which would be nice to support.
>>
>> I also wonder what is the advantage of not using a conjured symbol.
>>
>> Regards,
>> Gabor
>>
>>
>> On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <benicsbalazs at gmail.com
>> <mailto:benicsbalazs at gmail.com>> wrote:
>>
>> It turns out that if the region of `src` is alive, then the
>> required metadata symbol will be kept alive as well.
>> Here is the modified example:
>> void strcat_symbolic_src_length(char *src) {
>> char dst[8] = "1234";
>> strcat(dst, src);
>> clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
>> (void)*dst;
>> *(void)*src;* // Now we keep the `src` alive, thus any metadata
>> symbols to that region will be alive as well at the eval call.
>> }
>> ```
>>
>> It seems slightly confusing to me that depending on that `src` is
>> used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)`
>> will show either `TRUE` or `UNKNOWN`.
>> I think it should always give `TRUE` as an answer.
>>
>> Note that a metadata symbol is alive only if marked in use AND
>> its region is also alive.
>> Without the `(void)*src;` the region of `src` is dead, thus the
>> symbol ($meta{src} + 4) representing the cstring length of the
>> region `dst` will be dead too.
>>
>> Since this problem was caused by the handling of metadata
>> symbols, shouldn't we use conjured ones instead?
>> In that way, we would decouple the liveness of the cstring length
>> of a region and the region itself.
>>
>> What is the use-case for using a metadata symbol instead of a
>> conjured one?
>>
>> Artem Dergachev <noqnoqneo at gmail.com
>> <mailto:noqnoqneo at gmail.com>> ezt írta (időpont: 2020. aug. 7.,
>> P, 0:05):
>>
>>
>>
>> On 8/6/20 1:38 PM, Balázs Benics wrote:
>>> Gábor
>>> > How much work would it be to prototype keeping these
>>> expressions alive and measuring performance and memory
>>> implications?
>>> I'm not sure, since I'm not really experienced in this
>>> liveness stuff. I will try it.
>>>
>>> Artem
>>> > It sounds to me as if putting metadata symbols into the
>>> live set directly would have worked just fine.
>>> Ehm, I' not sure about this.
>>> If you look my example closely, you can note that the
>>> CStringChecker maps directly the SymRegion{reg_$0<char *
>>> src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>> long} symbol.
>>> So the SymbolReaper::markInUse will in fact place that
>>> meta_$2 symbol in the Live set.
>>> However later, when you query the SymbolReaper if the
>>> mentioned meta_$2 symbol is dead or not, it will answer you
>>> that it is *dead*.
>>
>> That's not how that works.
>>
>> markInUse() doesn't put anything into the live set, it puts
>> things into an auxiliary "metadata-in-use" set:
>> https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396
>>
>> On the other hand, isDead()/isLive() has to return true if
>> the symbol is present in the live set. In fact, that's the
>> first thing it checks for:
>> https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437
>>
>>>
>>> I'm quoting the related trace log:
>>>
>>> [...]
>>> *CStringChecker::**checkLiveSymbols marks
>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>> in use*
>>>
>>> # Even we marked the given symbols in use, we still
>>> removes them for some reason...
>>> CStringChecker::checkDeadSymbols finds the
>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long})
>>> + 4U' as dead; so removes the mapping from 'dst'
>>> *CStringChecker::**checkDeadSymbols finds the
>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>> as dead; so removes the mapping from
>>> 'SymRegion{reg_$0<char * src>}'*
>>>
>>>
>>> Why does the ExprEngine conjure a return symbol, if an
>>> evalCall already evaluated the call?
>>>
>>> Artem
>>> > See how RegionStore does that within
>>> `ScanReachableSymbols::scan(const SymExpr *)`
>>> I managed to implement a metadata collector visitor using
>>> the new SymExprVisitor infrastructure, in just a couple of
>>> lines of code. I was amazed by that :)
>>>
>>> Artem Dergachev <noqnoqneo at gmail.com
>>> <mailto:noqnoqneo at gmail.com>> ezt írta (időpont: 2020. aug.
>>> 6., Cs, 19:04):
>>>
>>> I- i- i was about to reply to that!
>>>
>>> I don't know why metadata-in-use is a thing at all. It
>>> sounds to me as if putting metadata symbols into the
>>> live set directly would have worked just fine. If you
>>> find any interesting counterexamples please let me know.
>>>
>>> Apart from that, indeed, the correct way to implement
>>> checkLiveSymbols when you're tracking arbitrary symbols
>>> is to iterate over these arbitrary symbols and mark all
>>> sub-symbols as live. See how RegionStore does that
>>> within `ScanReachableSymbols::scan(const SymExpr *)`.
>>> I.e., the following example works correctly and i expect
>>> CStringChecker to work similarly:
>>>
>>> ```
>>> int conjure();
>>>
>>> int foo() {
>>> int x = conjure();
>>> clang_analyzer_warnOnDeadSymbol(x);
>>> return x + 1;
>>> }
>>>
>>> void bar() {
>>> int y = foo(); // At this point `conj_$2` is no longer
>>> directly present in the state; only `conj_$2 + 1` is.
>>> (void)y;
>>> } // But despite that, `conj_$2` only dies here.
>>> ```
>>>
>>>
>>> On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
>>>> +Artem
>>>>
>>>> It would be great if the analyzer could reason about
>>>> code like that. I think Artem is the most competent in
>>>> these liveness related problems.
>>>> Aside from performance, I do not see any downside for
>>>> keeping the whole symbolic expression alive after
>>>> markInUse was called on it (hopefully Artem corrects me
>>>> if I'm wrong).
>>>> But mainly due to constraint solver limitations it
>>>> might not make sense to keep arbitrarily complex
>>>> expressions alive.
>>>>
>>>> @Balázs Benics <mailto:benicsbalazs at gmail.com>
>>>> How much work would it be to prototype keeping these
>>>> expressions alive and measuring performance and memory
>>>> implications?
>>>>
>>>>
>>>> On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev
>>>> <cfe-dev at lists.llvm.org
>>>> <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>
>>>> If you are really curious, here is some context.
>>>>
>>>> Imagine the following code (test.c):
>>>> ```
>>>> typedef typeof(sizeof(int)) size_t;
>>>>
>>>> void clang_analyzer_eval(int);
>>>> char *strcat(char *restrict s1, const char
>>>> *restrict s2);
>>>> size_t strlen(const char *s);
>>>>
>>>> void strcat_symbolic_src_length(char *src) {
>>>> char dst[8] = "1234";
>>>> strcat(dst, src);
>>>> clang_analyzer_eval(strlen(dst) >= 4); //
>>>> expected-warning{{TRUE}}
>>>> (void)*dst;
>>>> }
>>>> ```
>>>>
>>>> One would expect that the 'strlen(dst) >= 4' is
>>>> TRUE, but it returns UNKOWN.
>>>>
>>>> After doing a little bit of investigation - and
>>>> debug prints - I came up with the following trace.
>>>>
>>>> ---
>>>>
>>>> # In the CStringChecker::evalStrcat creates a
>>>> metadata symbol representing the cstring length of
>>>> the region pointed by 'src'.
>>>> created metadata symbol
>>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>>>
>>>> # After the evalStrcat evaluated the call, the
>>>> state contains the necessary mapping that the 'dst'
>>>> points to a cstring which is 4 + meta$4 long.
>>>> # Note that meta$4 represents the cstring length of
>>>> the region pointed by 'src'.
>>>> # So far so good. We know that resulting cstring
>>>> length precisely.
>>>> strcpy common END State.dump: "program_state": {
>>>> [...]
>>>> "checker_messages": [
>>>> { "checker": "unix.cstring.CStringModeling",
>>>> "messages": [
>>>> "CString lengths:",
>>>> "dst: (meta_$2{SymRegion{reg_$0<char *
>>>> src>},unsigned long}) + 4U",
>>>> "SymRegion{reg_$0<char * src>}:
>>>> meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
>>>> ""
>>>> ]}
>>>> ]
>>>> }
>>>>
>>>> # We mark all symbols as live in the cstring length
>>>> map.
>>>> # At least we think so...
>>>> CStringChecker::checkLiveSymbols marks
>>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}) + 4U' in use
>>>> CStringChecker::checkLiveSymbols marks
>>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}' in use
>>>> CStringChecker::checkLiveSymbols marks
>>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}' in use
>>>>
>>>> # Even we marked the given symbols in use, we still
>>>> removes them for some reason...
>>>> CStringChecker::checkDeadSymbols finds the
>>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}) + 4U' as dead; so removes the mapping from 'dst'
>>>> CStringChecker::checkDeadSymbols finds the
>>>> 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}' as dead; so removes the mapping from
>>>> 'SymRegion{reg_$0<char * src>}'
>>>>
>>>> # Now that state does not contain the cstring
>>>> length of the region pointed by 'dst'.
>>>>
>>>> ---
>>>>
>>>> Further investigation showed that even if I would
>>>> visit all the sub SymExprs looking for
>>>> SymbolMetadata would not help.
>>>> SymbolReaper::isDead would still show that the
>>>> '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>> long}) + 4U' SymExpr is dead.
>>>>
>>>> ---
>>>>
>>>> How can we preserve such metadata information?
>>>>
>>>> ---
>>>>
>>>> You can also reproduce this following these steps:
>>>>
>>>> Apply the add-debug-prints.patch on top of
>>>> 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo:
>>>> s/epomymous/eponymous/ NFC".
>>>> Analyze the 'test.c' file using this command:
>>>> ./bin/clang -cc1 -internal-isystem
>>>> lib/clang/12.0.0/include -nostdsysteminc -analyze
>>>> -analyzer-constraints=range -setup-static-analyzer
>>>> -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection
>>>> -analyzer-config eagerly-assume=false test.c
>>>>
>>>> Balázs Benics <benicsbalazs at gmail.com
>>>> <mailto:benicsbalazs at gmail.com>> ezt írta (időpont:
>>>> 2020. aug. 5., Sze, 10:29):
>>>>
>>>> Why does the SymbolReaper::markInUse only work
>>>> on SymbolMetadata symbols?
>>>>
>>>> void SymbolReaper::markInUse(SymbolRef sym) {
>>>> if (isa<SymbolMetadata>(sym))
>>>> MetadataInUse.insert(sym); }
>>>>
>>>> I think it is flawed if the Symbol is a
>>>> SymIntExpr holding an expression tree referring
>>>> to SymbolMetadata symbols. In such case, those
>>>> symbols would not be kept alive - causing some
>>>> confusion on the checker developers' side and
>>>> potentially losing some information about the
>>>> analysis.
>>>>
>>>> Should we walk the expression tree instead of
>>>> the current implementation?
>>>> What performance impact should we expect by
>>>> doing so?
>>>>
>>>> Any ideas?
>>>>
>>>> Balazs.
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200819/67a4441a/attachment-0001.html>
More information about the cfe-dev
mailing list