[cfe-dev] [analyzer] Calling SymbolReaper::markInUse on BinarySymExprs are ignored

Artem Dergachev via cfe-dev cfe-dev at lists.llvm.org
Wed Aug 19 15:02:39 PDT 2020


On 8/19/20 2:18 PM, Artem Dergachev wrote:
> Gabor, we've talked about this two days ago >_> +Nithin because we're 
> facing the same decision with the smart pointer checker to which raw 
> pointer value would make a lot of sense as SymbolMetadata.
>
> The way SymbolMetadata used in CStringChecker always bothered me. It 
> *is* used *as if* it's SymbolConjured: it represents an unknown value 
> of an expression of a specific type. As such it doesn't do much more 
> than preserve its identity for debugging purposes. We indeed don't 
> need another SymbolConjured.
>
> I believe we should remove SymbolConjured's identity elements from 
> SymbolMetadata and instead make it work more like SymbolRegionValue 
> works in RegionStore. Namely:
>
> 1. Remove statement, location context, block count from 
> SymbolMetadata's identity. Let it solely depend on the region.
>
> 2. Get rid of the metadata-in-use set. From now on SymbolMetadata, 
> like SymbolRegionValue, is live whenever its region is live.
>
> 3. When strlen(R) is used for the first time on a region R, produce 
> $meta<size_t R> as the answer, but *do not store* it in the string 
> length map. We don't need to change the state because the state didn't 
> in fact change. On subsequent strlen(R) calls we'll simply return the 
> same symbol (*because* the map will remain empty), until the string is 
> changed.
>
> 4. If the string is mutated, record its length as usual. But if the 
> string is invalidated (or the new length is completely unknown), *do 
> not remove* the length from the state; instead, set it to 
> SymbolConjured. Only remove it from the map when the region dies.
>
> 5. Keep string length symbols alive as long as they're in the map.
>
> This model is obviously correct because it mimics RegionStore which is 
> obviously correct (i mean, it's obviously broken beyond repair, but 
> *this* part seems to be correct). In particular, this model doesn't 
> have any obvious flaws that you point out in this thread. It also 
> makes sure that the identity of the state correctly reflects mutations 
> in the state. I think it's very much superior to the existing model. 
> This also more or less matches my belief of "everything is a Store" 
> that i wanted to push on a few years ago (though that one was more 
> radical; back then i wanted to make a "shadow" region for a string and 
> use the actual SymbolRegionValue of that shadow region instead of the 
> SymbolMetadata-as-suggested-in-this-mail).
>
> Balasz, are you willing to implement something like that?

*Balázs.
Sry!!

>
>
> On 8/19/20 7:59 AM, Gábor Horváth wrote:
>> Hi!
>>
>> Since the metadata symbol describes a property of the region I think 
>> in this example it might make sense to duplicate the metadata symbol 
>> for dst. But I can also imagine code using the length of a dead 
>> string which would be nice to support.
>>
>> I also wonder what is the advantage of not using a conjured symbol.
>>
>> Regards,
>> Gabor
>>
>>
>> On Wed, Aug 19, 2020, 3:08 PM Balázs Benics <benicsbalazs at gmail.com 
>> <mailto:benicsbalazs at gmail.com>> wrote:
>>
>>     It turns out that if the region of `src` is alive, then the
>>     required metadata symbol will be kept alive as well.
>>     Here is the modified example:
>>     void strcat_symbolic_src_length(char *src) {
>>       char dst[8] = "1234";
>>       strcat(dst, src);
>>     clang_analyzer_eval(strlen(dst) >= 4); // expected-warning{{TRUE}}
>>       (void)*dst;
>>     *(void)*src;* // Now we keep the `src` alive, thus any metadata
>>     symbols to that region will be alive as well at the eval call.
>>     }
>>     ```
>>
>>     It seems slightly confusing to me that depending on that `src` is
>>     used later or not, the `clang_analyzer_eval(strlen(dst) >= 4)`
>>     will show either `TRUE` or `UNKNOWN`.
>>     I think it should always give `TRUE` as an answer.
>>
>>     Note that a metadata symbol is alive only if marked in use AND
>>     its region is also alive.
>>     Without the `(void)*src;` the region of `src` is dead, thus the
>>     symbol ($meta{src} + 4) representing the cstring length of the
>>     region `dst` will be dead too.
>>
>>     Since this problem was caused by the handling of metadata
>>     symbols, shouldn't we use conjured ones instead?
>>     In that way, we would decouple the liveness of the cstring length
>>     of a region and the region itself.
>>
>>     What is the use-case for using a metadata symbol instead of a
>>     conjured one?
>>
>>     Artem Dergachev <noqnoqneo at gmail.com
>>     <mailto:noqnoqneo at gmail.com>> ezt írta (időpont: 2020. aug. 7.,
>>     P, 0:05):
>>
>>
>>
>>         On 8/6/20 1:38 PM, Balázs Benics wrote:
>>>         Gábor
>>>         > How much work would it be to prototype keeping these
>>>         expressions alive and measuring performance and memory
>>>         implications?
>>>         I'm not sure, since I'm not really experienced in this
>>>         liveness stuff. I will try it.
>>>
>>>         Artem
>>>         > It sounds to me as if putting metadata symbols into the
>>>         live set directly would have worked just fine.
>>>         Ehm, I' not sure about this.
>>>         If you look my example closely, you can note that the
>>>         CStringChecker maps directly the SymRegion{reg_$0<char *
>>>         src>} to the meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>         long} symbol.
>>>         So the SymbolReaper::markInUse will in fact place that
>>>         meta_$2 symbol in the Live set.
>>>         However later, when you query the SymbolReaper if the
>>>         mentioned meta_$2 symbol is dead or not, it will answer you
>>>         that it is *dead*.
>>
>>         That's not how that works.
>>
>>         markInUse() doesn't put anything into the live set, it puts
>>         things into an auxiliary "metadata-in-use" set:
>>         https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L396
>>
>>         On the other hand, isDead()/isLive() has to return true if
>>         the symbol is present in the live set. In fact, that's the
>>         first thing it checks for:
>>         https://github.com/llvm/llvm-project/blob/llvmorg-11.0.0-rc1/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp#L437
>>
>>>
>>>         I'm quoting the related trace log:
>>>
>>>             [...]
>>>             *CStringChecker::**checkLiveSymbols marks
>>>             'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>>             in use*
>>>
>>>             # Even we marked the given symbols in use, we still
>>>             removes them for some reason...
>>>             CStringChecker::checkDeadSymbols finds the
>>>             '(meta_$2{SymRegion{reg_$0<char * src>},unsigned long})
>>>             + 4U' as dead; so removes the mapping from 'dst'
>>>             *CStringChecker::**checkDeadSymbols finds the
>>>             'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>>             as dead; so removes the mapping from
>>>             'SymRegion{reg_$0<char * src>}'*
>>>
>>>
>>>         Why does the ExprEngine conjure a return symbol, if an
>>>         evalCall already evaluated the call?
>>>
>>>         Artem
>>>         > See how RegionStore does that within
>>>         `ScanReachableSymbols::scan(const SymExpr *)`
>>>         I managed to implement a metadata collector visitor using
>>>         the new SymExprVisitor infrastructure, in just a couple of
>>>         lines of code. I was amazed by that :)
>>>
>>>         Artem Dergachev <noqnoqneo at gmail.com
>>>         <mailto:noqnoqneo at gmail.com>> ezt írta (időpont: 2020. aug.
>>>         6., Cs, 19:04):
>>>
>>>             I- i- i was about to reply to that!
>>>
>>>             I don't know why metadata-in-use is a thing at all. It
>>>             sounds to me as if putting metadata symbols into the
>>>             live set directly would have worked just fine. If you
>>>             find any interesting counterexamples please let me know.
>>>
>>>             Apart from that, indeed, the correct way to implement
>>>             checkLiveSymbols when you're tracking arbitrary symbols
>>>             is to iterate over these arbitrary symbols and mark all
>>>             sub-symbols as live. See how RegionStore does that
>>>             within `ScanReachableSymbols::scan(const SymExpr *)`.
>>>             I.e., the following example works correctly and i expect
>>>             CStringChecker to work similarly:
>>>
>>>             ```
>>>             int conjure();
>>>
>>>             int foo() {
>>>               int x = conjure();
>>>               clang_analyzer_warnOnDeadSymbol(x);
>>>               return x + 1;
>>>             }
>>>
>>>             void bar() {
>>>               int y = foo(); // At this point `conj_$2` is no longer
>>>             directly present in the state; only `conj_$2 + 1` is.
>>>               (void)y;
>>>             } // But despite that, `conj_$2` only dies here.
>>>             ```
>>>
>>>
>>>             On 8/6/20 11:35 AM, Gábor Horváth via cfe-dev wrote:
>>>>             +Artem
>>>>
>>>>             It would be great if the analyzer could reason about
>>>>             code like that. I think Artem is the most competent in
>>>>             these liveness related problems.
>>>>             Aside from performance, I do not see any downside for
>>>>             keeping the whole symbolic expression alive after
>>>>             markInUse was called on it (hopefully Artem corrects me
>>>>             if I'm wrong).
>>>>             But mainly due to constraint solver limitations it
>>>>             might not make sense to keep arbitrarily complex
>>>>             expressions alive.
>>>>
>>>>             @Balázs Benics <mailto:benicsbalazs at gmail.com>
>>>>             How much work would it be to prototype keeping these
>>>>             expressions alive and measuring performance and memory
>>>>             implications?
>>>>
>>>>
>>>>             On Wed, 5 Aug 2020 at 16:17, Balázs Benics via cfe-dev
>>>>             <cfe-dev at lists.llvm.org
>>>>             <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>
>>>>                 If you are really curious, here is some context.
>>>>
>>>>                 Imagine the following code (test.c):
>>>>                 ```
>>>>                 typedef typeof(sizeof(int)) size_t;
>>>>
>>>>                 void clang_analyzer_eval(int);
>>>>                 char  *strcat(char *restrict s1, const char
>>>>                 *restrict s2);
>>>>                 size_t strlen(const char *s);
>>>>
>>>>                 void strcat_symbolic_src_length(char *src) {
>>>>                   char dst[8] = "1234";
>>>>                   strcat(dst, src);
>>>>                 clang_analyzer_eval(strlen(dst) >= 4); //
>>>>                 expected-warning{{TRUE}}
>>>>                   (void)*dst;
>>>>                 }
>>>>                 ```
>>>>
>>>>                 One would expect that the 'strlen(dst) >= 4' is
>>>>                 TRUE, but it returns UNKOWN.
>>>>
>>>>                 After doing a little bit of investigation - and
>>>>                 debug prints - I came up with the following trace.
>>>>
>>>>                 ---
>>>>
>>>>                 # In the CStringChecker::evalStrcat creates a
>>>>                 metadata symbol representing the cstring length of
>>>>                 the region pointed by 'src'.
>>>>                 created metadata symbol
>>>>                 'meta_$2{SymRegion{reg_$0<char * src>},unsigned long}'
>>>>
>>>>                 # After the evalStrcat evaluated the call, the
>>>>                 state contains the necessary mapping that the 'dst'
>>>>                 points to a cstring which is 4 + meta$4 long.
>>>>                 # Note that meta$4 represents the cstring length of
>>>>                 the region pointed by 'src'.
>>>>                 # So far so good. We know that resulting cstring
>>>>                 length precisely.
>>>>                 strcpy common END State.dump: "program_state": {
>>>>                   [...]
>>>>                   "checker_messages": [
>>>>                     { "checker": "unix.cstring.CStringModeling",
>>>>                 "messages": [
>>>>                       "CString lengths:",
>>>>                       "dst: (meta_$2{SymRegion{reg_$0<char *
>>>>                 src>},unsigned long}) + 4U",
>>>>                 "SymRegion{reg_$0<char * src>}:
>>>>                 meta_$2{SymRegion{reg_$0<char * src>},unsigned long}",
>>>>                       ""
>>>>                     ]}
>>>>                   ]
>>>>                 }
>>>>
>>>>                 # We mark all symbols as live in the cstring length
>>>>                 map.
>>>>                 # At least we think so...
>>>>                 CStringChecker::checkLiveSymbols marks
>>>>                 '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}) + 4U' in use
>>>>                 CStringChecker::checkLiveSymbols marks
>>>>                 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}' in use
>>>>                 CStringChecker::checkLiveSymbols marks
>>>>                 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}' in use
>>>>
>>>>                 # Even we marked the given symbols in use, we still
>>>>                 removes them for some reason...
>>>>                 CStringChecker::checkDeadSymbols finds the
>>>>                 '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}) + 4U' as dead; so removes the mapping from 'dst'
>>>>                 CStringChecker::checkDeadSymbols finds the
>>>>                 'meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}' as dead; so removes the mapping from
>>>>                 'SymRegion{reg_$0<char * src>}'
>>>>
>>>>                 # Now that state does not contain the cstring
>>>>                 length of the region pointed by 'dst'.
>>>>
>>>>                 ---
>>>>
>>>>                 Further investigation showed that even if I would
>>>>                 visit all the sub SymExprs looking for
>>>>                 SymbolMetadata would not help.
>>>>                 SymbolReaper::isDead would still show that the
>>>>                 '(meta_$2{SymRegion{reg_$0<char * src>},unsigned
>>>>                 long}) + 4U' SymExpr is dead.
>>>>
>>>>                 ---
>>>>
>>>>                 How can we preserve such metadata information?
>>>>
>>>>                 ---
>>>>
>>>>                 You can also reproduce this following these steps:
>>>>
>>>>                 Apply the add-debug-prints.patch on top of
>>>>                 7f1556f292ccfd80c4ffa986d5b849f915e5cd82 "Fix typo:
>>>>                 s/epomymous/eponymous/ NFC".
>>>>                 Analyze the 'test.c' file using this command:
>>>>                 ./bin/clang -cc1 -internal-isystem
>>>>                 lib/clang/12.0.0/include -nostdsysteminc -analyze
>>>>                 -analyzer-constraints=range -setup-static-analyzer
>>>>                 -analyzer-checker=unix.cstring,alpha.unix.cstring,alpha.security.taint,debug.ExprInspection
>>>>                 -analyzer-config eagerly-assume=false test.c
>>>>
>>>>                 Balázs Benics <benicsbalazs at gmail.com
>>>>                 <mailto:benicsbalazs at gmail.com>> ezt írta (időpont:
>>>>                 2020. aug. 5., Sze, 10:29):
>>>>
>>>>                     Why does the SymbolReaper::markInUse only work
>>>>                     on SymbolMetadata symbols?
>>>>
>>>>                     void SymbolReaper::markInUse(SymbolRef sym) {
>>>>                        if (isa<SymbolMetadata>(sym))
>>>>                          MetadataInUse.insert(sym); }
>>>>
>>>>                     I think it is flawed if the Symbol is a
>>>>                     SymIntExpr holding an expression tree referring
>>>>                     to SymbolMetadata symbols. In such case, those
>>>>                     symbols would not be kept alive - causing some
>>>>                     confusion on the checker developers' side and
>>>>                     potentially losing some information about the
>>>>                     analysis.
>>>>
>>>>                     Should we walk the expression tree instead of
>>>>                     the current implementation?
>>>>                     What performance impact should we expect by
>>>>                     doing so?
>>>>
>>>>                     Any ideas?
>>>>
>>>>                     Balazs.
>>>>
>>>>                 _______________________________________________
>>>>                 cfe-dev mailing list
>>>>                 cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>                 https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
>>>>             _______________________________________________
>>>>             cfe-dev mailing list
>>>>             cfe-dev at lists.llvm.org  <mailto:cfe-dev at lists.llvm.org>
>>>>             https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200819/67a4441a/attachment-0001.html>


More information about the cfe-dev mailing list