<div dir="ltr">Hi!<br><div class="gmail_extra"><br><div class="gmail_quote">On 10 November 2015 at 17:50, Artem Dergachev via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000">

    <div style="font-size:16px" lang="x-unicode">Hello Anna,<br>

      <br>

      Thanks a lot for another detailed reply!<br>

      <br>

      It took a while for us to think through the idea of "common-law"

      summaries, as opposed to our current "civil-law" summaries, and it

      still sounds as an approach probably promising but clearly very

      complicated, for which we do not currently have enough thought put

      into.

      <br>

      <br>

      Yeah, the problem of understanding what is essential in the caller

      state, and what parts of it can be simplified out as irrelevant,

      seems the most complicated so far. It is true that using the whole

      state as a pre-condition would result in too many summaries. The

      very nature of symbolic execution assumes that there are only few

      classes of interesting program states.

      <br>

      <br>

      It seems complicated to use a branch of inlined code as a summary

      for future reference, because alpha-renaming becomes much more

      complex (it would need to understand the context-less meaning of

      an in-context symbolic value, and only then transfer it into

      another context).

      <br>

      <br>

      Instead, we'd rather consider creating an auxiliary ExprEngine for

      summary collection, as we do now, and put the current caller state

      into it "for the reference": before taking a branch, ensure that

      we do not contradict the "reference state" after renaming.

      <br>

      <br>

      Then, yeah, we try to figure out how to handle the differences in

      the store, eg. between

      <br>

          (x,0,direct): 0 S32b

      <br>

          ...

      <br>

          (x,0,direct): 2015 S32b

      <br>

          ...

      <br>

      without creating separate summary branches for each case. Even if

      the execution path seems to be determined only by the fact that,

      say, reg_$0<x> is in range [0, 10], we cannot be sure that

      it is absolutely irrelevant for the checker whether

      reg_$0<x> is 2 S32b or 3 S32b, until we ask the checker (for

      instance, reg_$0<x> may be some sort of enum or bitfield

      passed into an API call modelled by this checker inside the

      summary graph, and the checker leaves its internal mark in the GDM

      regarding its value, so that to take a completely different path

      later, but the engine may accidentally consider the difference

      between 2 S32b and 3 S32b irrelevant, because all the if's took

      the same branch so far, and hence ignore important information and

      cause false positives.

      <br></div></div></blockquote><div><br></div><div>I think there are ways to mitigate or solve this problem. Some of the checkers can add assumptions. E.g.: when you divide by a value which is unknown an assumption such that this value is not zero is added to the state. Each time a checker does something like this, there is a potential error case there, so that decision might need to be revised in each new context. It is also possible to tag states by the checkers. Maybe those tags could be used to help to decide what to do in these scenarios?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      Finally, i do not instantly see how the "problem of unknowns" is

      solved with this approach (as far as i understand, the point was

      about non-concrete values rather than about UnknownVal

      specifically). In a particular "precedent" context, for which the

      summary is constructed, even if it is not the "True" state, some

      values may still be unknown. And we'd still need to record and

      replay our decisions based on those values when we apply the

      summary branch. Alternatively, we need make the "not knowing" some

      things a part of the pre-condition, but then the checkers still

      need to explain which particular things they should "not know".

      That is, contexts may not only be contradictory, but also they may

      be more specific or less specific, and this also needs to be taken

      into account, as in these cases they are significantly different

      even if comparable (nested).<br>

      <br>

      =========

      <br>

      <br>

      Note that we are actually talking about two kinds of support from

      checkers: First is about handling alpha-renaming inside their

      internal structures in the GDM, second is saving-replaying the

      unknown-value-based delayed checks.

      <br>

      <br>

      However the support for alpha-renaming in checkers doesn't have

      anything to do with the summary. In general, we cannot even

      support the following IPA-free test without making checker API

      more complicated:

      <br>

      <br>

        void foo(FILE *f1, FILE *f2) {

      <br>

          if (f1 == f2) {

      <br>

            fclose(f1);

      <br>

            fclose(f2); // expected-warning{{double close}}

      <br>

          }

      <br>

        }

      <br>

      <br>

      We can think of alpha-renaming as a purely constraint-solver-side

      task, and in this case the checker would need to consult the

      constraint manager every time it sees a symbol, to see if it is

      equal to one of the symbols it already possesses (hey

      reg_$0<f2>, are you sure you are not equal to

      reg_$1<f1>?), which results in a manual O(N) lookup instead

      of regular O(log N) lookup in the GDM maps.

      <br>

      <br>

      Instead, we can also think of alpha-renaming as a one-time process

      of completely eliminating the symbol that was renamed into another

      symbol from all parts of the program state, including GDM, and

      making sure it doesn't ever reincarnate. Eg., ensure that

      SValBuilder would not construct the old symbolic value again, even

      if asked directly to do so, but return the new symbolic value

      instead. This approach seems preferable to us, because it doesn't

      require O(N) lookups described above, and all the necessary code

      is put in the single callback.

      <br>

      <br>

      Note that while an "evalRenaming()" callback is necessary to

      support alpha-renaming, it is not really scary if this callback is

      not instantly implemented in the checker; the checker would just

      fail in these situations as it always did. For summary callbacks,

      this is not the case.

      <br></div></div></blockquote><div><br></div><div>I think this might start to get very hard to track. I would prefer a new kind of symbol that is going to be the subject of actualization. So we could assert when a symbol that is not supposed to be actualized is being renamed, or a such symbol is used after summary application without actualization.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div>

      <br>

      =========

      <br>

      <br>

      Regarding the heap and invalidation problem. Our current

      placeholder approach is that whenever the function does at least

      any global invalidation, when applying the summary we run

      invalidation as if the whole function was evaluated

      conservatively, and then write over the store contents that remain

      in the summary. This approach is overly conservative for many

      reasons. Additionally, it affects alpha-renaming of SymbolDerived.

      In our current approach, SymbolDerived always gets renamed into a

      SymbolDerived with renamed parent symbol and parent region, which

      is also conservative.

      <br>

      <br>

      I can come up with a much less conservative approach, which

      assumes recording and replaying invalidations. For instance, if in

      a summary we invalidate the whole heap through an unknown pointer,

      in the caller context we may know that this pointer points to a

      concrete region, and then SymbolDerived, which appeared during

      this invalidation, would resolve into a non-derived symbol. It is

      still more lightweight than replaying all stores:

      <br>

          - First, we take the store before the first invalidation, and

      merge it;

      <br>

          - Then we take the first invalidation and perform it according

      to the context;

      <br>

          - Then we take the store before the second invalidation, and

      merge it;

      <br>

          - Then we perfor the second invalidation;

      <br>

          - Etc, until we perform all invalidations, and finally take

      the final store and merge it.

      <br> </div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      Then this whole "cascading invalidation" is only good for two

      things: (1) Modelling the final store after the call and (2)

      Properly renaming SymbolDerived - all other symbols are completely

      unaffected! We need to construct the whole invalidation cascade

      before anything, even before ranges, so that ranges on

      SymbolDerived were proper.

      <br>

      <br>

      The easiest way i see to explain why SymbolDerived is special is

      to say that symbols are constant in time, yet often tied to

      particular moments of time. Regions are completely "space-like":

      meaning of a region doesn't change in time, though we may know

      more about the region during the analysis (eg. rename a symbolic

      region into a concrete region or into a heap symbolic region).

      Symbols are, instead, born in a certain moment of time, and:

      <br>

      <br>

      --= unlike the regular alpha-renaming, the context switch for

      summary (which we earlier referred to as "actualization") can be

      described as a time shift, in term of symbol birth times,

      affecting initially unpredictable sets of atomic symbols =--

      <br>

      <br>

      For example, SymbolRegionValue is the value in the store at the

      "beginning of analysis" time moment, and when we shift this time

      moment to moment of applying the summary, it gets renamed into the

      value of the store at this moment, but it is irrelevant if some

      sort of invalidation had or had not happened after the call-enter

      event to which it is now tied. SymbolConjured is tied to the

      moment of conjuring, which is recorded in the form of block count.

      Actualization assumes time-shifting this block count.

      <br>

      <br>

      Similarly, SymbolDerived is tied to the value of the region at the

      moment of time in which invalidation occured, and this moment also

      needs to be shifted to the relevant moment in the invalidation

      cascade.

      <br>

      <br>

      We cannot yet predict how resource-heavy the cascading

      invalidation approach would be, and it more or less works with our

      current dummy approach, so we can implement the cascading

      invalidation, or delay it for later patches.<br></div></div></blockquote><div><br></div><div>Basically the invalidation depends on ordering of events, so it requires timestamps I guess. Also, when a summary is applied, basically the checker get a list of exploded nodes and it can replay some of the state changes to based on the context and the summary. But what happens when the checker needs to do some invalidation or add some assumption on aliasing (that affects invalidation)? I have the feeling that each time we want to solve some of these problems we are getting close and closer to a full replay. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      =========

      <br>

      <br>

      Regarding interaction between inter-dependent checkers. While we

      didn't code this yet, it might be partially naively solved by

      introducing dependency graphs for checkers. Eg. a checker that

      relies on taint propagation may pull the GenericTaintChecker as

      its dependency, and then our checker gets summary-applied after

      GenericTaintChecker, when the taint information is already

      available. Another minor use case for this feature

      is that if you explicitly run only one path-sensitive checker that

      only registers itself on C++ files, then you can skip pulling core

      checkers on C and ObjC files and run only AST checkers for them,

      saving analysis time.

      <br></div></div></blockquote><div><br></div><div>Dependency graph for checkers would be useful regardless of summary based IPA. I had to do workarounds several times because the lack of this. But there are more to these dependencies than checkers depending on each other. Checkers might also add constraints that the core symbolic execution depends on.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      While we may think of a single pipeline for all summary events

      (this checker marks the file descriptor as open, then that checker

      marks malloc as leaked, then this checker marks this symbol as

      tainted, then the first checker marks the file descriptor as

      closed, ...), i think it is unnecessary for the very same reason:

      meaning of symbols does not change through time, so sounds to me

      as if it's against the very idea of the symbol to code it this

      way. Though i may be missing something, it's an open choice for

      us.

      <br>

      <br>

      =========

      <br>

      <br>

      Below is a roadmap for us now, as we see it. In some cases we need

      some help with choosing between a quick and a full-featured

      solution.<br>

      <br>

      =1= We're done with a full-featured SValVisitor, similar to

      StmtVisitor, which would be helpful for alpha-renaming, and

      probably for many other things, maybe some checkers would be

      interested. In order to make sure it compiles, i made a simple

      "SVal explainer":

      <br>

      <br>

        int glob;

      <br>

        void test(int param, void *ptr) {

      <br>

          clang_analyzer_explain(&glob); //

      expected-warning{{pointer to variable 'glob'}}

      <br>

          clang_analyzer_explain(param); // expected-warning{{initial

      value of variable 'param'}}

      <br>

          clang_analyzer_explain(ptr); // expected-warning{{pointee of

      initial value of variable 'ptr'}}

      <br>

          if (param == 42)

      <br>

            clang_analyzer_explain(param); // expected-warning{{signed

      32-bit integer '42'}}

      <br>

        }

      <br>

      <br>

      This also may be useful if the checker wants to refer to a value

      in the warning message in a user-friendly manner. If anybody likes

      something like this as a separate commit, i can open a review.<br></div></div></blockquote><div><br></div><div>Just an update, as far as I remember this is already committed.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      =2= Then we start with alpha-renaming, and we've got the following

      problems to deal with:

      <br>

          - Decide if we want to keep renaming inside the constrant

      manager or let it work through destroying symbols, as described in

      this letter;

      <br>

          - Decide how to work with renaming of concrete offsets in

      store keys into symbolic+concrete offsets, as described in the

      previous letter <a href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html" target="_blank">http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html</a>

      (the last part of the miscellaneous section). Possible solutions

      include making store keys heavier or reconstructing original

      regions by offsets (tried this, very ugly and unreliable, imho;

      the former solution would also solve the issue pointed out in <a href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html" target="_blank"></a><a href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html" target="_blank">http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html</a>).

Solving

      this by recording-and-replaying all stores is something we'd try

      to avoid as best as we can, i guess.<br></div></div></blockquote><div><br></div><div>What is the status of this one? Do you have a patch under review? <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      =3= Decide if we want context-sensitive or context-insensitive

      summaries, as discussed above in this letter.

      <br></div></div></blockquote><div><br></div><div>Is there a consensus on this one?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      =4= If context-insensitive approach is chosen, implement summaries

      for checkers, solve interdependency problems, as discussed above

      in this letter.

      <br></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

      =5= Implement store actualization and invalidation, either through

      simple-conservative approach, or via "cascading" invalidation, as

      described above in this letter.

      <br></div></div></blockquote><div><br><div>Lets try to summarize what are the tasks:<br></div><div>- Summary apply callback for checkers<br></div><div>- Solve the interdependency problem<br></div><div>- Update all of the checkers?<br></div><div>- Timestamps to solve some aliasing related problems<br></div><div>- Filter confusing assuming.. stuff from path reconstruction<br></div><div>- Figure out how to do leak related checks properly<br></div>- Figure out how to do invalidation properly<br></div><div>- Anything I missed?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><div style="font-size:16px" lang="x-unicode">

      <br>

    </div>

  </div>

<br>_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

<br></blockquote></div><br></div></div>