[PATCH] D136811: -Wunsafe-buffer-usage: WIP: RFC: NFC: User documentation.

Thu Nov 3 18:54:18 PDT 2022

jkorous added inline comments.

================
Comment at: clang/docs/SafeBuffers.rst:173
+  #pragma unsafe_buffer_usage begin
+
+Such pragmas not only enable incremental adoption with much smaller granularity,
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > Have you considered allowing a short-hand form of the pragmas that are scoped to the current block scope (or file scope if applied at global scope)? e.g.,
> > > ```
> > > void func() {
> > >   {
> > >     #pragma only_safe_buffers
> > >     int stuff[10] = {};
> > >     int x = stuff[1]; // Warning: you're a bad person who should feel bad
> > >   }
> > >   int array[10] = {};
> > >   int y = array[1]; // this_is_fine.png
> > > }
> > > ```
> > We discussed other options but generally prefer simple pragmas for their flexibility.
> > Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> > While this might be more verbose it would be dead simple to reason about.
> > We discussed other options but generally prefer simple pragmas for their flexibility.
> 
> A single pragma with lexical scoping behaves like RAII, which is often a replacement for more complex acquire/release paired operations and is a pretty well-understood pattern in C++.
> 
> > Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> > While this might be more verbose it would be dead simple to reason about.
> 
> True, it is easy to reason about. But not allowing nesting for readability purposes removes some utility, especially for folks in C where macros are much more common. e.g.,
> ```
> #define ONLY_SAFE_BUFFERS _Pragma("only_safe_buffers")
> 
> #define AWESOME_BUFFER_THING(ptr, size) ({ ONLY_SAFE_BUFFERS; (interesting work goes here) })
> 
> void func(int *ptr, size_t size) {
>   ONLY_SAFE_BUFFERS;
>   ... interesting code lives here ...
>   AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested.
>  .. more interesting code lives here ...
> }
> ```
> (You can hit the same situation with paired pragmas.) Because C doesn't have constexpr or consteval support for functions or templates, function-like macros are a pretty common interface folks use in practice and macro expansion makes it somewhat hard to avoid potential surprising conflicts.
It is desirable to minimize the extent of code labeled as unsafe - if a single line of code is unsafe then it'd be suboptimal do label a whole scope as unsafe.
Since scopes affect objects lifetime in C++ we won't be able to generally recommend adding a scope around the unsafe line either as that might change behavior of the code.
In this sense associating the pragmas' semantics with scopes reduces their flexibility. Is there any benefit of doing that which would outweigh this?

Separately from that - in theory the users could just end the scope before the macro:
```
  ONLY_SAFE_BUFFERS_BEGIN;
  ... interesting code lives here ...
  ONLY_SAFE_BUFFERS_END;
  AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested.
```

But point taken - there might be options for better ergonomics.

================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
NoQ wrote:
> aaron.ballman wrote:
> > jkorous wrote:
> > > aaron.ballman wrote:
> > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > 
> > > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > > 
> > > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > 
> > > Our model depends on Standard Library used implementing the bounds checks.
> > > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > > 
> > > Let's assume we're starting with this buggy code:
> > > 
> > > ```
> > > void caller() {
> > >   int buffer[10];
> > >   callee(buffer, 20);
> > > }
> > > 
> > > void callee(int* ptr, unsigned size) {
> > >   ptr[size-1] = 42;
> > >   ptr[size] = 42;
> > >   ptr[100] = 42;
> > > }
> > > ```
> > > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > > ```
> > > void caller() {
> > >   int buffer[10];
> > >   callee(std::span<int>(buffer, 20), 20);
> > > }
> > > 
> > > void callee(std::span<int> ptr, unsigned size) {
> > >   ptr[size-1] = 42; // still unmitigated
> > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > }
> > > ```
> > > Our model depends on Standard Library used implementing the bounds checks.
> > 
> > So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> > 
> > > ```
> > > void caller() {
> > >   int buffer[10];
> > >   callee(std::span<int>(buffer, 20), 20);
> > > }
> > > 
> > > void callee(std::span<int> ptr, unsigned size) {
> > >   ptr[size-1] = 42; // still unmitigated
> > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > }
> > > ```
> > 
> > Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".
> 
> Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.
> 
> So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.
> 
> On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.
> 
> Does that make sense?
The warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors.
We don't plan to target any other Standard Library implementation but the machinery we'll bring up should allow that to be implemented in the future.

================
Comment at: clang/docs/SafeBuffers.rst:365-370
+The fixits emitted by the warning are correct modulo placeholders. Placeholders
+are the only reason why fixed code is allowed to fail to compile.
+Incorrectly resolving the placeholder is the only reason why fixed code
+will demonstrate incorrect runtime behavior compared to the original code.
+In an otherwise well-formed program it is always possible (and usually easy)
+to resolve the placeholder correctly.
----------------
NoQ wrote:
> aaron.ballman wrote:
> > jkorous wrote:
> > > aaron.ballman wrote:
> > > > Because fixits can be applied automatically to an entire source file, we don't typically allow fix-its that can break code (or fixits where there are multiple choices for the user to pick from) unless they're attached to a *note* instead of a warning (or error).
> > > > 
> > > > Are you planning to follow that same pattern when placeholders are involved?
> > > > 
> > > > (For example, one potential use for automatic fix-its is to apply them automatically to source on commit before running precommit CI; similar to how folks sometimes run clang-format to automate "the source is different before/after the clang-format, so you should look into that".)
> > > That's an extremely interesting question! Let me add our current thinking as a context for the discussion.
> > > 
> > > Our analysis will have its limits and will not be able to provide 100% complete and guaranteed correct code transformation in every case.
> > > 
> > > At the same time it sounds suboptimal both from our and our users perspective to not emit anything for such situations.
> > > Imagine we have a code transformation that affects 20 lines of code and all we need is for the user to address a single function call argument. 
> > > The thought of bailing, producing just a warning and having user to manually change all 20 lines of code sounds very unsatisfactory to me.
> > > That's why we are thinking to use placeholders.
> > > 
> > > The next question is - can we not break the code when using placeholders?
> > > By definition whenever we'd emit a placeholder our analysis can't provide a correct solution. That means that whatever way we'd make the code to compile can introduce a logic bug in the program. That is simply unacceptable given the overall goal of our effort.
> > > That pretty much leaves us with only using placeholders that would break the build.
> > > 
> > > WDYT?
> > > 
> > > I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.
> > > I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.
> > 
> > We document the expectations here: https://clang.llvm.org/docs/InternalsManual.html#fix-it-hints and the interesting bit is:
> > 
> > "Fix-it hints on errors and warnings need to obey these rules:
> > 
> > * Since they are automatically applied if -Xclang -fixit is passed to the driver, they should only be used when it’s very likely they match the user’s intent.
> > * Clang must recover from errors as if the fix-it had been applied.
> > * Fix-it hints on a warning must not change the meaning of the code. However, a hint may clarify the meaning as intentional, for example by adding parentheses when the precedence of operators isn’t obvious.
> > 
> > If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on notes are not applied automatically."
> > 
> > I think using placeholders on warnings rather than notes will be an issue for both the first and second bullets.
> I like that! This way we can even emit multiple alternative fixes, eg.
> ```
> note: change type of variable 'ptr' to 'std::span' to add bound checks
> note: change type of variable 'ptr' to 'std::span::iterator' to add bound checks
> ```
> (I'm still not sure we'll ever going to recommend iterators as a good solution, but they're generally a better drop-in replacement for a raw pointer, so despite their increased verbosity, we're considering them at least in some cases)
> 
> Another thing this can take care of is support for custom containers:
> ```
> note: change type of variable 'vla' to 'std::vector' to add bound checks
> note: change type of variable 'vla' to 'llvm::SmallVector' to add bound checks
> ```
>  Now we can simply present all available containers to the user, and the user can select the one they like the most.
Yes, we should look into attaching FixIts that contain placeholders to notes instead.

Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811