[PATCH] D136811: -Wunsafe-buffer-usage: WIP: RFC: NFC: User documentation.

Mon Nov 7 08:17:33 PST 2022

aaron.ballman added inline comments.

================
Comment at: clang/docs/SafeBuffers.rst:31
+    convert large amounts of old code to conform to the warning;
+  - Attribute ``[[unsafe_buffer_usage]]`` lets you annotate custom functions as
+    unsafe, while providing a safe alternative that can often be suggested by
----------------
xazax.hun wrote:
> NoQ wrote:
> > aaron.ballman wrote:
> > > xazax.hun wrote:
> > > > aaron.ballman wrote:
> > > > > Hmmmm that attribute name looks problematic (more below in that section).
> > > > With C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened. 
> > > I think it makes sense on module export (the module author knows whether they are buffer-safe or not), but what do you have in mind for import where the user may not be in a position to make that assertion validly?
> > > I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened.
> > 
> > It's usually not that valuable for us to know that outside the module. If the module wasn't hardened, there's nothing the "caller module" can do about it, the recommended approach is to harden the module first, and then use the attribute to indicate which functions were found to be unsafe in that process (especially if safe alternatives were implemented).
> > 
> > So it's only valuable when the module *will never be hardened*, and the only thing the caller module can do in such case is isolate calls to that module, maybe cover it with a safe abstraction. In this case, yeah, I think it makes sense. Maybe we should consider this eventually.
> Wouldn't `only_safe_buffers` make sense on the import side to ensure that no unhardened code is being imported from a specific module?
> Wouldn't only_safe_buffers make sense on the import side to ensure that no unhardened code is being imported from a specific module?

But the import side has no way to know that the module being imported has no unhardened code, only the module implementer knows that.

That said, I could imagine wanting to mark a module import as "will never be hardened" as a blanket way to say "don't complain about interfaces in this module I have no control over".

But now I'm questioning something from the docs:

> Pragmas ``unsafe_buffer_usage`` and ``only_safe_buffers`` and allow you to
> annotate sections of code as either already-hardened or not-to-be-hardened,

It now occurs to me there's two ways to read this.

`unsafe_buffer_usage` = will diagnose any unsafe buffer usage, it's already hardened.
`only_safe_buffers` == should only be used with known-safe buffers, it's not going to be hardened

or

`unsafe_buffer_usage` == this uses unsafe buffers, it's not to be hardened
`only_safe_buffers` == this uses only safe buffers, it's already been hardened

I originally read it the first way from the docs in the patch but the comment from @xazax.hun reads to me like he read it the second way.

================
Comment at: clang/docs/SafeBuffers.rst:40-41
+  - Finally, in order to avoid bugs in newly converted code, the
+    Clang static analyzer provides a checker to find misconstructed
+    ``std::span`` objects.
+
----------------
xazax.hun wrote:
> jkorous wrote:
> > NoQ wrote:
> > > xazax.hun wrote:
> > > > Isn't this overly specific? What about `string_view`s? 
> > > Yeah I think we can cover string views as well!
> > This is based on our current plans for near future.
> > While we are positive we will use `std::span` in the FixIts it is very unlikely we'd use `string_view`. That's why having a checker for `std::span` is more important for us now.
> Does this mean you'd recommend using a `span<char>` over `string_view` in all cases? I think that might surprise some users, so I wonder if it is worth documenting. 
Oh, I was sort of assuming we'd be suggesting `string_view` over `span<char>` when appropriate. So agreed that we should clearly document this as it may surprise folks.

Out of curiosity, why the strong preference for `span`?

================
Comment at: clang/docs/SafeBuffers.rst:132
+    attribute ``[[unsafe_buffer_usage]]``,
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
----------------
NoQ wrote:
> aaron.ballman wrote:
> > `NULL` as well, I presume? (This matters for C where `NULL` often expands to `(void *)0`.)
> Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. Like, I didn't say it's a *literal*. Is there a better term?
Are you okay with constant expressions? e.g.,
```
consteval int foo() { return 0; }

void func(char *ptr) {
  char *offset_ptr = ptr + foo(); // Is this fine?
}
```

================
Comment at: clang/docs/SafeBuffers.rst:173
+  #pragma unsafe_buffer_usage begin
+
+Such pragmas not only enable incremental adoption with much smaller granularity,
----------------
jkorous wrote:
> aaron.ballman wrote:
> > jkorous wrote:
> > > aaron.ballman wrote:
> > > > Have you considered allowing a short-hand form of the pragmas that are scoped to the current block scope (or file scope if applied at global scope)? e.g.,
> > > > ```
> > > > void func() {
> > > >   {
> > > >     #pragma only_safe_buffers
> > > >     int stuff[10] = {};
> > > >     int x = stuff[1]; // Warning: you're a bad person who should feel bad
> > > >   }
> > > >   int array[10] = {};
> > > >   int y = array[1]; // this_is_fine.png
> > > > }
> > > > ```
> > > We discussed other options but generally prefer simple pragmas for their flexibility.
> > > Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> > > While this might be more verbose it would be dead simple to reason about.
> > > We discussed other options but generally prefer simple pragmas for their flexibility.
> > 
> > A single pragma with lexical scoping behaves like RAII, which is often a replacement for more complex acquire/release paired operations and is a pretty well-understood pattern in C++.
> > 
> > > Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> > > While this might be more verbose it would be dead simple to reason about.
> > 
> > True, it is easy to reason about. But not allowing nesting for readability purposes removes some utility, especially for folks in C where macros are much more common. e.g.,
> > ```
> > #define ONLY_SAFE_BUFFERS _Pragma("only_safe_buffers")
> > 
> > #define AWESOME_BUFFER_THING(ptr, size) ({ ONLY_SAFE_BUFFERS; (interesting work goes here) })
> > 
> > void func(int *ptr, size_t size) {
> >   ONLY_SAFE_BUFFERS;
> >   ... interesting code lives here ...
> >   AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested.
> >  .. more interesting code lives here ...
> > }
> > ```
> > (You can hit the same situation with paired pragmas.) Because C doesn't have constexpr or consteval support for functions or templates, function-like macros are a pretty common interface folks use in practice and macro expansion makes it somewhat hard to avoid potential surprising conflicts.
> It is desirable to minimize the extent of code labeled as unsafe - if a single line of code is unsafe then it'd be suboptimal do label a whole scope as unsafe.
> Since scopes affect objects lifetime in C++ we won't be able to generally recommend adding a scope around the unsafe line either as that might change behavior of the code.
> In this sense associating the pragmas' semantics with scopes reduces their flexibility. Is there any benefit of doing that which would outweigh this?
> 
> Separately from that - in theory the users could just end the scope before the macro:
> ```
>   ONLY_SAFE_BUFFERS_BEGIN;
>   ... interesting code lives here ...
>   ONLY_SAFE_BUFFERS_END;
>   AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested.
> ```
> 
> But point taken - there might be options for better ergonomics.
> In this sense associating the pragmas' semantics with scopes reduces their flexibility. Is there any benefit of doing that which would outweigh this?

FWIW, I wasn't suggesting we support *only* the scoped form, but that we *also* supported a scoped form. The benefit really boils down to more complex functions (generally ones in need of refactoring anyway) where you already have an appropriate scope handy and pairing the pragmas would be clutter. e.g.,
```
if (ptr) {
  #pragma clang unsafe_buffer_usage
  ptr += 12;
}
```
vs
```
if (ptr) {
  #pragma clang unsafe_buffer_usage begin
  ptr += 12;
  #pragma clang unsafe_buffer_usage end
}
```
I don't see a need to write two pragmas in these kinds of cases where the compound scope is already a clear delimiter. WDYT?

================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
jkorous wrote:
> NoQ wrote:
> > aaron.ballman wrote:
> > > jkorous wrote:
> > > > aaron.ballman wrote:
> > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > 
> > > > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > > > 
> > > > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > 
> > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > > > 
> > > > Let's assume we're starting with this buggy code:
> > > > 
> > > > ```
> > > > void caller() {
> > > >   int buffer[10];
> > > >   callee(buffer, 20);
> > > > }
> > > > 
> > > > void callee(int* ptr, unsigned size) {
> > > >   ptr[size-1] = 42;
> > > >   ptr[size] = 42;
> > > >   ptr[100] = 42;
> > > > }
> > > > ```
> > > > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > > > ```
> > > > void caller() {
> > > >   int buffer[10];
> > > >   callee(std::span<int>(buffer, 20), 20);
> > > > }
> > > > 
> > > > void callee(std::span<int> ptr, unsigned size) {
> > > >   ptr[size-1] = 42; // still unmitigated
> > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > }
> > > > ```
> > > > Our model depends on Standard Library used implementing the bounds checks.
> > > 
> > > So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> > > 
> > > > ```
> > > > void caller() {
> > > >   int buffer[10];
> > > >   callee(std::span<int>(buffer, 20), 20);
> > > > }
> > > > 
> > > > void callee(std::span<int> ptr, unsigned size) {
> > > >   ptr[size-1] = 42; // still unmitigated
> > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > }
> > > > ```
> > > 
> > > Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".
> > 
> > Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.
> > 
> > So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.
> > 
> > On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.
> > 
> > Does that make sense?
> The warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors.
> We don't plan to target any other Standard Library implementation but the machinery we'll bring up should allow that to be implemented in the future.
> Does that make sense?

It does, but the situation I'm thinking of is where the user is migrating their unsafe code to this new programming mode and aren't able to use libc++ to get the extra hardening checks. Consider this contrived example:
```
extern "C" void func(int *array, size_t count) {
  // Do buffer things with array and count
}
```
The user can't (easily) modify this code to accept a span directly because that could potentially break ABI. So they wrap the data in a span:
```
extern "C" void func(int *array, size_t count) {
  std::span<int> sp(array, count);
  // Do buffer things with sp
}
```
In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses `span` -- but more trustworthy based on what? There's no hardened checks in the STL they're using and the potentially-wrong information is the same flavor of potentially-wrong in either case.

My fear is that we push users to make changes like that by telling them it's safer when it's actually functionally equivalent instead, but when they migrates their code they accidentally *introduce* a new bug where there wasn't one before.

I'm used to thinking about this sort of stuff as a kind of taint analysis. We need to know where that data comes from in order to know whether we can trust it or not. Did the "count" come from user input on the command line... or did it come from a call to `.size()` on some other container? These have different levels of trust (to me).

Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like `extern "C"` or the function linkage to try to reduce the amount of ABI breaks suggested?

================
Comment at: clang/docs/SafeBuffers.rst:257-259
+* Use of a function annotated by such attribute causes ``-Wunsafe-buffer-usage``
+  warning to appear, instead of ``-Wdeprecated``, so they can be
+  enabled/disabled independently as all four combinations make sense;
----------------
NoQ wrote:
> aaron.ballman wrote:
> > I don't think I understand what you're trying to say here.
> I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the function expects *all* users of the function to move away from it.
> 
> But if the vendor thinks that the function is generally fine, and it's only important to move away from it when the `-Wunsafe-buffer-usage` analysis is requested by the user, then `[[unsafe_buffer_usage]]` is more appropriate as it only causes a warning to be emitted under that condition, not every time.
> 
> I'll wordsmith.
> I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the function expects *all* users of the function to move away from it.

Ahhhh that makes a lot more sense to me, thanks!

================
Comment at: clang/docs/SafeBuffers.rst:352
+
+  - The compatibility overload contains a ``__placeholder__`` which needs
+    to be populated manually. In this case ``size`` is a good candidate.
----------------
jkorous wrote:
> aaron.ballman wrote:
> > xazax.hun wrote:
> > > Just a small concern. While users are not allowed to use identifiers like this, there is always a chance someone has a global variable of this name and the code ends up compiling. So, if we want to be absolutely safe, we should either have logic to pick a unique name, or not generate fixits in that case. Although, this is probably rare enough that we could just hand-wave.
> > Agreed; we should not have a fixit that suggests use of a reserved identifier.
> Our thinking is to actually use syntactically incorrect placeholder to make sure that users can't accidentally miss it.
> This is related to:
> https://reviews.llvm.org/D136811#inline-1324542
> Our thinking is to actually use syntactically incorrect placeholder to make sure that users can't accidentally miss it.

So long as that happens on a note, that would be fine depending on how you name the identifier. e.g., remember that we have a feature flag to support dollar signs in identifiers, we need to protect against macros, that sort of thing.

Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811