[cfe-dev] [Analyzer] Equality of pointers

Mon Jun 10 00:10:27 PDT 2019

Great points. See my comments inline.

On Mon, 10 Jun 2019 at 05:31, Artem Dergachev <noqnoqneo at gmail.com> wrote:

> I believe in our current situation we should simply generate a sink
> whenever we discover symbolic pointer aliasing, because RegionStore always
> implicitly assumes that there's no aliasing. Eg., consider:
>
>   void example1(int *x, int *y) {
>     *x = 1; // (SymRegion{reg_$1<x>}, 0, direct): 1 S32b
>     *y = 2; // (SymRegion{reg_$2<y>}, 0, direct): 2 S32b
>     if (x == y) {
>       // ???
>     }
>   }
>
> The correct way to model the then-branch of the if-statement is to figure
> out which of the bindings is *newer* and keep it; in this case it'll mean
> dropping the binding of 'x' and keeping the binding of 'y'. This would mean
> that the value of the common pointee of 'x' and 'y' would be equal to 2.
> However, in its current shape RegionStore simply does not track history of
> bindings. We can make it track the history, but it'll be a ridiculously
> complex data structure - and even the current simple structure is so
> complicated and messy that nobody understands how it works.
>

In case aliasing is rare enough we could just traverse the path backwards
until we get the first store to the memory region in question. But I do
agree that this would be a suboptimal solution. What do you mean by
tracking the history? Wouldn't for the bindings be sufficient? (Though I do
understand that making bindings more fat is undesired.)

>
> All right, suppose we make a new Store implementation that'll handle
> aliasing perfectly and is otherwise perfect in every way. Suppose we also
> have a perfect constraint solver that knows that in the following code
>
>   void example2(int *x, int *y) {
>     if (*x > 5)
>       if (*y < 10)
>         if (x == y) {
>           // ???
>         }
>   }
>
> the valid constraint range for the common pointee of 'x' and 'y' would be
> [6, 9] (this won't be immediately solved by the Z3 solver because heap
> shape information still needs to be stuffed into it). Now, how about these
> examples?:
>
>   void example3(int *x, int *y) {
>     delete x;
>     if (x == y) {
>       // ???
>     }
>   }
>

I can see three possible ways to deal with this example. One is what you
described in this mail, rerunning the analysis with a new set of
constraints. The second one is what you proposed earlier and called smart
GDM if I recall correctly. This way the analyzer engine could update the
GDM whenever it learns about new aliasing information. The third one is
something I encountered in some other tools: keeping track of alias
equivalence classes and use one element of the class to identify the whole
class (lets call this representative for now). Unfortunately, this would
require us add new callbacks to the checker API e.g.:  in case the
representative is dead and we need a new one for the class.  The checkers
could update their GDM in this callback.

>
>   void example4(int *x, int *y) {
>     delete x;
>     delete y;
>     if (x == y) {
>       // ???
>     }
>   }
>
> In example 3 the common pointee would be deleted. It'd cause a warning if
> 'y' is used later. In example 4 it's an outright retroactive double-free.
> We need to modify MallocChecker to handle both of these cases correctly.
>

I think this one is very similar to division by zero in a sense that
whenever we see a division operation we either report a warning or assume
that the denominator is not zero. So if we want to be consistent with that
behavior we should always either report double free, or (in this particular
code snippet) assume that x and y do not alias. Of course, this assumption
might increase the number of constraints significantly, so I can really see
why is this not done at the moment.

>
> If the same happens to RetainCountChecker, we'll need to *add* reference
> counts of the two pointers together. That's already a large variety of very
> non-trivial operations that we need to perform. How about aliasing file
> descriptors (which aren't even necessarily pointers)? How about aliasing
> two mutexes one of which is locked and the other is unlocked? How much
> completely non-trivial and non-reusable code do we need to write in dozens
> of checkers to solve this problem, given that we can barely solve it for
> RegionStore?
>

While I see your point, I would argue all these checkers need to do
different things when a value become dead. This is why we have callbacks as
customization points. The question is, whether we could define a callback
that makes sense, not too complicated to use, and we can retain the old
behavior if a checker does not implement it. Of course, I do agree that if
we can solve the problem without complicating the checker API that would be
great.

>
> ===========
>
> What we *can* do, however, is to not only generate the sink, but *restart*
> the analysis from the beginning, assuming that the two pointers point to
> the same memory from the start. Say, force the SymbolManager that's owned
> by this analysis's ExprEngine to produce reg_$1<x> whenever it's asked for
> reg_$2<y>. This would make sure that we don't make wrong assumptions in the
> first place, so that we didn't have to undo them in the middle of the
> analysis. This wouldn't require much checker-side support, so it'll be much
> more in the spirit of the Analyzer. And it's also much easier to implement.
> As usual, we'll need some heuristics to make sure we don't exponentially
> explode on the number of different combinations of pointers that can
> potentially alias.
>

We might discover multiple aliasing pointers during the initial pass. Do
you plan to do two runs one with all the pointers unaliased and one with
all the discovered aliasing assumed? Or do you think it is worth to do
reruns with other configurations? While it is definitely easier from the
checker maintainers point of view, the performance might be worse compared
providing new callbacks. My other concern is slightly different. With new
assumptions the shape of the exploded graph could be very different, so the
analysis budget could be spent on very different parts of the program. I
wonder if this would make it harder to understand coverage patterns for us
(what if a code is only covered in one of the runs? Is it desired or bad in
that particular case?). And finally, what would be the implications from
the user's point of view? Do we need to explain the initial assumptions,
i.e. why we assumed that two pointers might alias? If one of the
assumptions are wrong how could the user communicate that to the analyzer?
Where should she put an assert?

To summarize, while I do agree that keeping the checker API simple is good
I am not convinced yet whether one of the solutions are clearly superior to
the others.

>
>
> On 05.06.2019 05:23, Gábor Horváth via cfe-dev wrote:
>
> Hi!
>
> This is an excellent point and it is not easy to solve. One of the reason
> why your original idea is hard to implement is the following. Let's look at
> the code snippet below:
> void f(int *p, int *q) {
>   int *p2 = p + 2;
>   if (p == q) {
>     // Here we learned something new about p2 as well.
>     // So we might need to update all the symbolic expressions and
> bindings, not just p.
>   }
> }
>
> I think your second idea should work well, but I have no idea about the
> performance implications. As a starting point we could implement your
> second solution and do some benchmarks both performance wise and looking at
> the results from some open source projects to see what happens to the false
> positive rate.
>
> Out of curiosity, do we have the same problems with invalidations?
>
> I am sure Artem will have some more to add but as far as I know he is busy
> with WWDC this week.
>
> Regards,
> Gabor
>
> On Tue, 4 Jun 2019 at 14:53, Ádám Balogh via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hello,
>>
>>
>>
>> This is not the first time that we are facing the following problem.
>>
>>
>>
>> Given a code that clears all non-sentinel elements from a circular
>> double-linked list with sentinel element:
>>
>>
>>
>> typedef struct ListNode {
>>
>>   struct ListNode *next;
>>
>>   struct ListNode *prev;
>>
>>   void *data;
>>
>> } ListNode;
>>
>>
>>
>> void ListClear(ListNode *list) {
>>
>>   while (list->next != list) {
>>
>>     ListNode *node = list->next;
>>
>>
>>
>>     node->next->prev = node->prev;
>>
>>     node->prev->next = node->next;
>>
>>
>>
>>     node->next = NULL;
>>
>>     node->prev = NULL;
>>
>>
>>
>>     free(node);
>>
>>   }
>>
>> }
>>
>>
>>
>> unix.Malloc Use after free is reported in the second iteration of the
>> loop for `node`.
>>
>>
>>
>> This is not necessarily a false positive if analyzed top level. The
>> function that clears the nodes of the list implicitly assumes that it is a
>> consistent circular double-linked list with a sentinel node. However this
>> invariant is not ensured anywhere inside the function. It is thus not known
>> that the `prev` pointer of the first non-sentinel node points back to the
>> sentinel node, neither is it known that applying a `prev` and a `next`
>> after each other in any order leads back to the original node. Without
>> these invariants the analyzer cannot know that after unlinking and freeing
>> the first non-sentinel element and then taking again the first non-sentinel
>> element is not equal to the original sentinel element. So it assumes that
>> they are equal and used again after freeing its memory space. Thus from the
>> perspective of the standalone function it can be a use-after-free situation.
>>
>>
>>
>> So I fixed the function by inserting assertions to ensure the invariant:
>>
>>
>>
>> void ListClear(ListNode *list) {
>>
>>   while (list->next != list) {
>>
>>     ListNode *node = list->next;
>>
>>
>>
>>     assert(node->prev == list);
>>
>>     assert(node->next->prev == node);
>>
>>     assert(node->prev->next == node);
>>
>>
>>
>>     node->next->prev = node->prev;
>>
>>     node->prev->next = node->next;
>>
>>
>>
>>     node->next = NULL;
>>
>>     node->prev = NULL;
>>
>>
>>
>>     free(node);
>>
>>   }
>>
>> }
>>
>>
>>
>> However, it did not help because the analyzer cannot handle equality of
>> pointers correctly. If we have an assumption that two pointers are equal
>> the analyzer still creates two symbolic regions and records a constraint
>> that the symbols of the symbolic regions are equal. Later this constraint
>> can be used for deciding whether they are equal or not but it does not
>> handle them as the same region. So if we change the pointer
>> `node->prev->next` to `node->next` then `node->prev->next` is changed but
>> `list->next` not even if we assumed that they are the same. That is the
>> reason the analyzer takes the freed region again in the next iteration of
>> the loop and reports an error.
>>
>>
>>
>> A strange thing is that the constraint manager rearranges the comparison
>> of the two symbols (of the symbolic regions) to a difference without
>> `aggressive-binary-operation-simplification` disabled. (It should not do it
>> even if this option is enabled since the ranges of symbolic values are not
>> constrained to `MAX/4` of the type.)
>>
>>
>>
>> The question is how to fix this. I think that if two pointers are the
>> same, then they must point to exactly the same memory region which means
>> exactly the same symbol if it is a symbolic region. Thus instead of
>> assigning `$reg2` to `node->prev` and recording whether the range of
>> `$reg2-$reg0` is `[0..0]` the analyzer should assign `$reg0` to
>> `node->prev` which is the region of `list`. Similarly, `node->next->prev`
>> should not become `$reg7` where the range of `$reg7-$reg1` is `[0..0]` and
>> `node->prev->next ` should not become `$reg11` where the range of
>> `$reg11-$reg1` is `[0..0]` but both should become `$reg1`. This way when we
>> unlink `node` `node->next->prev` becomes `reg0` (thus the region of `list`)
>> and `node->prev->next` e.g. `$reg2`. This way in the next iteration
>> `node->next` is `$reg2` instead of the freed `$reg1` so the false warning
>> goes away. Of course I know this is not easy to implement. If both pointers
>> already have regions assigned then which one to chose and how to replace
>> the other one?
>>
>>
>>
>> Another solution could be iterating all the existing regions upon
>> bindings and do the same binding to the regions equal to the affected
>> region. (I hope I managed to express myself clearly enough here.)
>>
>>
>>
>> This is an old weakness of the Analyzer which should be solved eventually
>> because it causes false positives. A new constraint manager does not help
>> because the problem is not there.
>>
>>
>>
>> Any ideas?
>>
>>
>>
>> Regards,
>>
>>
>>
>> Ádám
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
> _______________________________________________
> cfe-dev mailing listcfe-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190610/f29d48e7/attachment.html>