<div dir="ltr"><div>Hi!</div><div><br></div><div>I wanted to share some of my points as well but had little time to do so.</div><div><br></div><div>Artem identified two kinds of issues. One of which has all the information available in the bug path (in his terminology: must-have-happened) and the other which has not ( should-not-have-happened). The goal is the same in both of the cases, identify all dependencies including control and data dependencies. The main difference is that in one case we could use all the information available in the bug path while in the other we cannot. <br></div><div><br></div><div>1. must-have-happened:</div><div>Since the bug-path contains all the information we need, I do agree with Artem using bug path visitors might be sufficient. I do not know how reliable `trackExpressionValue` is in identifying data dependencies, but revising it could be one step in GSoC. For example, after quickly skimming through the implementation I am not sure if it is able to track the source of a constant value that was constant folded during the symbolic execution from different values. Only if we had a way to test this functionality in isolation quickly :) I think making that possible would also be a valuable addition. <br></div><div><br></div><div>For control dependencies, it is great that LLVM already has some algorithms to calculate dominator sets and it might be possible to make that work for the Clang CFG. <br></div><div><br></div><div>In my opinion, if you only tackle these problems by the end of the summer your project already brought great value to the community.</div><div><br></div><div>2. should-not-have-happened <br></div><div>This category cannot be solved only using visitors for sure. So we might end up with an implementation for this problem that is completely separate from the previous one (maybe except for getting control dependencies?). As Artem pointed out, we might query some information from the analyzer e.g. which functions were inlined. One option is to use a slicing algorithm here. However, we should make sure that the way a function is conservatively evaluated by the slicing algorithm and the analyzer is compatible. E.g.: if the analyzer does not invalidate a memory region during evaluating the call, the statement should not be considered as relevant for the variable in question. We have several challenges here, the analyzer reasons about memory regions while the slicing algorithm will probably not. Also, we need to do something with pointers, since we do not want to implement a full-fledged pointer analysis. We also need to decide how to represent arrays, structures etc.</div><div><br></div><div>I skimmed through the algorithm in the google docs, and I think there are some other details we need to work out. For example, you never remove anything from the set of relevant variables. Consider the following example:</div><div><span style="font-family:courier new,monospace">x = 5;</span></div><div><span style="font-family:courier new,monospace">y = x + 2;</span></div><div><span style="font-family:courier new,monospace">y = 2;</span></div><div><span style="font-family:courier new,monospace">z = y + 3;</span><br></div><div><br></div><div>Whit the slicing criterion (<span style="font-family:courier new,monospace">z = y + 3</span>). Here, once <span style="font-family:courier new,monospace">y</span> is overwritten with <span style="font-family:courier new,monospace">2</span>, it is no longer relevant, so<span style="font-family:courier new,monospace"> x</span> should not end up in the slice. The algorithm you proposed will not handle this correctly (yet). Also I am not sure if we actually need the<span style="font-family:courier new,monospace"> relevant_variable_map</span> or having a set will be sufficient.<br></div><div><br></div><div>Regards,<br></div><div>Gabor</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 27 May 2019 at 12:52, Kristóf Umann <<a href="mailto:dkszelethus@gmail.com">dkszelethus@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">+ Ádám Balogh</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 24 May 2019 at 23:31, Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div bgcolor="#FFFFFF">

    Ok, i think i have a little bit of clarity.<br>

    <br>

    Let's first talk about must-have-happened problems. For instance,<br>

    <br>

      int *foo() {<br>

          return coin() ? new int : nullptr; // needs note<br>

      }<br>

      void bar() {<br>

        int *x = foo();<br>

        *x = 1; // warning: null dereference<br>

      }<br>

    <br>

    In this case trackExpressionValue solves the "must-have-happened"

    problem of "x must-have been assigned a null pointer value" by

    displaying a note when foo() returns nullptr. This currently more or

    less works correctly - there are a lot of bugs but the overall idea

    behind trackExpressionValue is correct.<br>

    <br>

    Example 1 in your document is another example of a

    must-have-happened problem: in order for the report to be valid, we

    need to show that 'flag' must-have-changed between line 17 and line

    13. That's something that the Static Analyzer currently cannot

    handle and you plan to improve upon it.<br>

    <br>

    Hʏᴘᴏᴛʜᴇsɪs<span class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457value_inner gmail-m_-3323113253945746305gmail-m_-2632453692712164457times"></span>

    1. All "must-have-happened" problems should be solved with a bug

    visitor. We don't need any new analysis algorithm for that.<br>

    <br>

    In particular, i believe that Example 1 can be solved by extending

    trackExpressionValue() with a bug visitor that detects

    control-flow-dependencies via this approach of yours:<br>

    <br>

    > "Check whether the statement is a control dependency of

    Statement 18 with dominator sets: 18 doesn't post dominate 17, but

    every statement in between them does, meaning that 17 is a control

    dependency of 18."<br>

    <br>

    I.e., while ascending through ExplodedGraph, we pick interesting

    statements and perform this domination check on their predecessor

    nodes until we reach the control dependency, then we track the

    control dependency recursively by adding more visitors.<br>

    <br>

    I think this is the first thing we should try on our GSoC.<br>

    <br>

    The reason why i believe that Hypothesis 1 is true is that all the

    information that we need is fully contained in the bug path. If

    something must have happened on the path, then it's probably in the

    path and we can figure it out by looking at the path. If something

    must have happened on the path and it's not in the path, then why

    did we emit a warning in the first place?<br>

    <br>

    ==============<br>

    <br>

    Now, to should-not-have-happened problems. The main motivating

    example is:<br>

    <br>

      void foo(int *y) {<br>

          if (coin()) // needs note<br>

            *y = 1;<br>

      }<br>

      void bar() {<br>

        int x;<br>

        foo(&x);<br>

        use(x); // warning: use of uninitialized value<br>

      }<br>

    <br>

    In order to make the warning comprehensible, we have to explain to

    the user why do we think that 'x' is uninitialized, as it clearly

    "should-not-have-been" initialized for the warning to be correct.

    For that purpose it is absolutely essential to display the execution

    path within the inlined call to foo().<br>

    <br>

    One way to achieve that would be to display *all* execution path and

    leave it up to the user to see that the event that

    should-not-have-happened has indeed never happened. <br>

    <br>

    That, however, would be overwhelming, so we use "path pruning" that

    cuts away execution paths in inlined calls that are known to have no

    interesting events happening. Which, in turn, makes us incapable of

    solving should-not-have-happened problems, as it's the whole point

    of the problem to display execution path in inlined functions in

    which nothing has happened.<br>

    <br>

    In order to figure this out, we need to learn how to discriminate

    between inlined calls in which nothing interesting *could* have

    happened in the first place and inlined calls in which something

    interesting could have happened but *didn't*.<br>

    <br>

    NoStoreFuncVisitor solves this problem for the example above. The

    visitor notices that the local variable is passed by non-const

    pointer into the inlined call and the call syntactically contains

    assignments through this pointer, therefore this call *could* have

    initialized the value. The visitor then suppresses pruning of this

    call by emitting an interesting note within the call ("returning

    without initializing '*y'"...).<br>

    <br>

    However, AST-based analysis in NoStoreFuncVisitor is very primitive

    and easy to trick. A more sophisticated analysis is clearly

    necessary.<br>

    <br>

    The second thing we can try in our GSoC is to come up with a better

    analysis specifically for the uninitialized value checker, because

    we already have a place to stick it into and we know how exactly do

    we want to consume the result of such analysis and evaluate it.<br>

    <br>

    The third thing we can try in our GSoC is to come up with more such

    analysis for other checkers and other should-have-happened problems

    and see if a reusable framework for creating such analyses can be

    implemented.<br>

    <br>

    <br>

    <div class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457moz-cite-prefix">On 5/23/19 1:25 PM, Kristóf Umann

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Thu, 23 May 2019 at

            21:58, Kristóf Umann <<a href="mailto:dkszelethus@gmail.com" target="_blank">dkszelethus@gmail.com</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div dir="ltr">

              <div>Please let me know if I didn't address something you

                mentioned!</div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Thu, 23 May 2019 at

                  02:29, Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> My primary question is, how do

                    you plan to use the data that you obtain via your

                    analysis?</div>

                </blockquote>

                <div><br>

                </div>

                <div> Gather relevant statements. Keep ExplodedNodes in

                  the bugpath for which

                  PathDiagnosticLocation::getStmt() is in the set (or,

                  something like that at least). But honestly, I didn't

                  think too much about that yet :^) I guess we could

                  just gather ExplodedNodes rather than statements,

                  we'll see.</div>

                <div> </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> Like, do you plan to backtrack

                    the value of all relevant variables and/or

                    expressions in the final bug report that were also

                    encountered along the bug path? If yes, then is it

                    possible that the slicing criterion gets updated in

                    the process and you'll have to re-run the CFG-based

                    analysis to take it into account?<br>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>No, the slicing criterion is what it is, and will

                  not change. The set of relevant statements to the

                  slicing criterion define the actual program slice,

                  which is basically the thing we're going for in this

                  project. When it turns out that a value of a variable

                  directly affects a variable in the slicing criterion

                  (either through data or control dependency), we just

                  add it to the set of relevant variables, and then if

                  something in the set of relevant variables is affected

                  (some statements or variables turn out to be a

                  control/data dependencies to them), then it's added to

                  the respective relevancy sets. Should we only traverse

                  the basic blocks the analyzer visited, I think we can

                  pull off the slicing with a single pass.<br>

                </div>

              </div>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>We'll see about that single pass thing though. I'm

            gathering some tricky examples in the document I shared and

            working on a neat algorithm.</div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div dir="ltr">

              <div class="gmail_quote">

                <div> </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> > What would also be really

                    great is to assist this traversal with the

                    information the analyzer already has, essentially

                    only inspecting basic blocks the analyzer actually

                    visits.<br>

                    <br>

                    You mean visits on the current bug path or visits on

                    the equivalence class of bugs or visits in general

                    during analysis?<br>

                    <br>

                    Regardless, for many problems ideally we should

                    traverse basic blocks that the analyzer *didn't*

                    visit (eg., if the function didn't initialize the

                    variable on the current path, we're interested in

                    parts of the code in which it *did* initialize the

                    variable, even though we didn't necessarily have

                    time to visit them).

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>Well I mean slicing by definition isn't intended to

                  do that. The entire point of it is to gather the

                  smallest slice related to the slicing criterion, and

                  my project aims to fill in the gaps where we don't

                  provide enough information.</div>

                <div><br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> It actually sounds as if all

                    problems that we're trying to solve here can be

                    classified into "must-have-happened" problems (eg.,

                    "variable must-have been declared without

                    initialization", "variable must-have been set to

                    nullptr", "memory must-have been allocated") and

                    "should-not-have-happened" problems (eg., "variable

                    should-not-have been initialized", "null value

                    should-not-have been overwritten", "pointer

                    should-not-have escaped").<br>

                    <br>

                    For must-have-happened problems, i'm recently under

                    an impression that we should suppress the bug report

                    entirely when we fail to solve them (i.e., if fail

                    to explain to the user where exactly does this

                    happen, then the report is impossible to understand

                    anyway). This is currently a very big problem for

                    our null dereference checker on some codebases,

                    especially because it uses this tracking for

                    suppressions as well (aka inlined defensive checks),

                    so when it fails to track the variable it's likely a

                    legit false positive as well, not simply a

                    hard-to-understand report.<br>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>I think this set calculating approach inherently

                  gathers far more information, allowing us to make

                  better judgement on whether we should suppress the

                  report.</div>

                <div> </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> For should-not-have-happened

                    problems i'm much more confused. We're talking about

                    looking for places where it *could* have happened

                    and then trying to explain to the user why none of

                    them have actually happened. I'm not sure what are

                    the consequences of failing to explain to the user

                    why didn't a particular piece of code do something,

                    because i've no idea how do users intuitively figure

                    out which code *could* have done these things and

                    which clearly couldn't.<br>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>Im really at a loss here :) Can you provide some

                  example as to what a problematic "must-have-happened"

                  bug report would look like, and how would you like to

                  see it improved? Same for "should-not-have-happened".

                  Because as I understand it, and I might be wrong, you

                  have problems with this:</div>

                <div><br>

                  <font face="courier new, monospace">int a; //

                    declaring a without initializing<br>

                    <br>

                  </font></div>

                <div><font face="courier new, monospace">if (rand()) //

                    assuming that the condition is false</font></div>

                <div><font face="courier new, monospace">  a = 5;</font></div>

                <div><font face="courier new, monospace"><br>

                  </font></div>

                <div><font face="courier new, monospace">print(a); // a

                    is undef</font></div>

                <div><br>

                  And prefer to see this:<br>

                  <br>

                  <font face="courier new, monospace">int a; //

                    declaring a without initializing<br>

                    <br>

                  </font></div>

                <div><font face="courier new, monospace">if (rand()) //

                    should this condition be false, a's value will be

                    indeterministic</font></div>

                <div><font face="courier new, monospace">  a = 5; //

                    writing to a skipped</font></div>

                <div><font face="courier new, monospace"><br>

                  </font></div>

                <div><font face="courier new, monospace">print(a); // a

                    is undef</font></div>

                <div><br>

                </div>

                <div>and I just don't see how this would be possible

                  with slicing at all. Also, I can't see how this would

                  scale to real-world production code.</div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div bgcolor="#FFFFFF"> <br>

                    <div class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487moz-cite-prefix">On

                      5/22/19 4:11 PM, Kristóf Umann wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div dir="ltr">

                          <div dir="ltr"><span id="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-docs-internal-guid-3c74b999-7fff-e99a-77f6-4bd5385fb78f">

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Hi!</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">I'd like to share some thoughts about my GSoC project, "Enhancing bug reporting with static backward program slicing"[1].</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">My proposal is very specific about </span><span style="font-family:Arial;background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">what</span><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> I'm aiming to improve on, but vague on the </span><span style="font-family:Arial;background-color:transparent;font-style:italic;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">how</span><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> part of it. This mail aims to add clarify some of this.</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Program slicing is essentially a technique of discovering data and control dependencies to the slicing criterion, which is a (statement, {set of variables}) pair. Fortunately, tools for control dependencies, namely, dominator set calculations are already implemented, but seems to be unstable with clang's CFG. It would be a great tool if I were able to fix it.</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">While my proposal states that I plan to implement an AST-based solution, I'm actually not sure that this is the optimal approach. We could instead inspect CFG blocks in a backwards manner (coupled with the analyzer's call stack), and gather relevant variable in the meanwhile.</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">What would also be really great is to assist this traversal with the information the analyzer already has, essentially only inspecting basic blocks the analyzer actually visits.</span></p>

                              <br>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Here’s my idea for an algorithm (from the point of the slicing criterion already being constructed):</span></p>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">

</span></p>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing" target="_blank">https://docs.google.com/document/d/1Lx867o3meyQsj0WKOSWMdosSBdw2MUq1aIxyPJM6iLU/edit?usp=sharing</a>

</span></p>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><br>

                              </p>

                              <p style="line-height:1.38;margin-top:0pt;margin-bottom:0pt">Please

                                note that this is a veeery rough sketch,

                                I didn't think about every edge case

                                that exists, but I think it's an okay

                                start.</p>

                              <p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt"><br>

                              </p>

                            </span><span id="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-docs-internal-guid-5954bf5e-7fff-362e-5b73-2faf8cd77d3f"><span style="font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">[1] <a href="https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit" target="_blank">https://docs.google.com/document/d/1ci1BCAKojPlqIxIw1J_K2dnATA3z01PuwR_vHJS55TI/edit</a></span></span>  <br class="gmail-m_-3323113253945746305gmail-m_-2632453692712164457gmail-m_-2008566417967606901m_-7889642444734106153gmail-m_-8947792374282452487gmail-Apple-interchange-newline">

                          </div>

                        </div>

                      </div>

                    </blockquote>

                    <br>

                  </div>

                </blockquote>

              </div>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

  </div>

</blockquote></div>

</blockquote></div></div>