[PATCH] D153059: [-Wunsafe-buffer-usage] Group parameter fix-its

Thu Jun 22 12:22:50 PDT 2023

NoQ added inline comments.

================
Comment at: clang/lib/Analysis/UnsafeBufferUsage.cpp:2293
+    //      search of connected components.
+    if (!ParmsNeedFix.empty()) {
+      auto First = ParmsNeedFix.begin(), Last = First;
----------------
ziqingluo-90 wrote:
> NoQ wrote:
> > What about the situation where params aren't seen as unsafe yet, but they're discovered to be unsafe later? Eg.
> > ```
> > void foo(int *p, int *q) {
> >   int *p2 = p;
> >   p2[5] = 7;
> >   int *q2 = q;
> >   q2[5] = 7;
> > }
> > ```
> > Will we notice that `p` and `q` need to be fixed simultaneously?
> > 
> > ---
> > 
> > I suspect that the problem of parameter grouping can't be solved by pre-populating strategy implications between all parameters. It'll either cause you to implicate all parameters (even the ones that will never need fixing), or cause you to miss connections between parameters that do need fixing (as the connection is discovered later).
> > 
> > Connection between parameters needs to be added *after* the implications algorithm has finished. And once it's there (might be already there if I missed something), then this part of code probably becomes unnecessary.
> Yes.  They will be noticed.  Here is why:
> 
> The code here first forms a ring over `p` and  `q` in the assignment (directed) graph:  
> ```
> p <--> q
> ```
> Then the two initialization statements (implemented in [[ https://reviews.llvm.org/D150489 | D150489 ]]) add two more edges to the graph:
> ```
> p2 --> p <--> q <-- q2
> ```
> The algorithm later searches the graph starting from unsafe variables `p2` and `q2`, respectively,  Starting from `p2`, the algorithm discovers that `p2` and `p`, as well as `p` and `q` depend on each other.  Starting from `q2`,  the algorithm discovers that `q2` and `q`, as well as `p` and `q` depend on each other.   The dependency graph is sort of undirected. So eventually, the four variables `p2`, `p`, `q`, `q2` are in the same group.
> 
> The code here first forms a ring over `p` and  `q` in the assignment (directed) graph

I don't think it does. The ring is formed over the `ParamsNeedFix` list, which is empty because none of these parameters are listed in `UnsafeOps.byVar`.

================
Comment at: clang/test/SemaCXX/warn-unsafe-buffer-usage-fixits-multi-parm-span.cpp:13-15
+   expected-note{{change type of 'p' to 'std::span' to preserve bounds information, and change 'q' and 'a' to 'std::span' to propagate bounds information between them}} \
+   expected-note{{change type of 'q' to 'std::span' to preserve bounds information, and change 'p' and 'a' to 'std::span' to propagate bounds information between them}} \
+   expected-note{{change type of 'a' to 'std::span' to preserve bounds information, and change 'p' and 'q' to 'std::span' to propagate bounds information between them}}
----------------
ziqingluo-90 wrote:
> ziqingluo-90 wrote:
> > NoQ wrote:
> > > QoL thing: //"to propagate bounds information between them"// isn't the correct reason in this case. Bounds information doesn't propagate between `p` and `q` and `a`. Each of them carries its own, completely unrelated bounds information.
> > > 
> > > We need to come up with a different reason. Or it might be better to combine the warnings into one warning here, so that we didn't need a reason:
> > > ```
> > > warning: 'p', 'q' and 'a' are unsafe pointers used for buffer access
> > > note: change type of 'p', 'q' and 'a' to 'std::span' to preserve bounds information
> > > ```
> > > 
> > > This gets a bit worse when parameters are implicated indirectly, but it still mostly works fine, for example:
> > > ```
> > > void foo(int *p, int *q) {
> > >   int *p2 = p;
> > >   p2[5] = 7;
> > >   int *q2 = q;
> > >   q2[5] = 7;
> > > }
> > > ```
> > > would produce:
> > > ```
> > > warning: 'p2' and 'q2' are unsafe pointers used for buffer access
> > > note: change type of 'p2' and 'q2' to 'std::span' to preserve bounds information, and
> > >       change 'p' and 'q' to 'std::span' to propagate bounds information between them
> > > ```
> > > 
> > > ---
> > > 
> > > In any case, this shows that the relationship between parameters (grouped together simply for being parameters) is very different from the relationship within groups of variables implicated by assignments (even if two or more of them are parameters). All the way down to the warning/note printer, we probably need to continue giving the parameter relationship a special treatment.
> > This is a good argument!  An edge from `p` to `q` in an assignment graph stands for not only that fixing `p` implicates fixing `q` but also that the bounds information is propagated from `q` to `p`.   In this sense, the way I connect parameters is inappropriate.
> > 
> > But my concern is that although treating parameters specially could improve the quality of the diagnostic messages, the code may be more complex and less re-used.
> So for such an example
> ```
> int f(int *p, int *q) {
> int * a = p;
> int * b = q;
> 
> a[5] = 0;
> b[5] = 0;
> }
> ```
> We will fix all four variables `a, b, p, q` together.  But bounds information propagation only happens between `a` and `p`,  and `b` and `q`, respectively. 
> Then what should the diagnostic message be?  How about something like
> ```
> change type of 'a' to 'std::span' to preserve bounds information, and change 'p' to propagate bounds information between them. 
> To ensure parameters are changed together, also
> change type of 'b' to 'std::span' to preserve bounds information, and change 'q' to propagate bounds information between them;
> ...
> ```
> for variable `a`.
I think the ideal narrative is to emit a single warning against the entire function, which covers all parameters at once, together with all related variables.

```lang=c++
int f(int *p, int *q) {  // warning: function 'f' performs unsafe buffer access over parameters 'p' and 'q'
                         // note: change type of 'p' and 'q' to 'std::span' to receive bounds information from the call site,
                         //       and change local variables 'a' and 'b' to 'std::span' to propagate it to the access site

  int * a = p;           // note: bounds information needs to propagate from 'p' to 'a' here
  int * b = q;           // note: bounds information needs to propagate from 'q' to 'b' here

  a[5] = 0;              // note: used in buffer access here
  b[5] = 0;              // note: used in buffer access here
}
```

This makes a lot of sense given that we're fixing the entire function. This is quite a major change though, we need to think this through.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153059/new/

https://reviews.llvm.org/D153059