[PATCH] D136811: [-Wunsafe-buffer-usage] WIP: RFC: NFC: User documentation.

Mon Nov 28 20:54:51 PST 2022

NoQ added inline comments.

================
Comment at: clang/docs/SafeBuffers.rst:31
+    convert large amounts of old code to conform to the warning;
+  - Attribute ``[[unsafe_buffer_usage]]`` lets you annotate custom functions as
+    unsafe, while providing a safe alternative that can often be suggested by
----------------
aaron.ballman wrote:
> xazax.hun wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > xazax.hun wrote:
> > > > > aaron.ballman wrote:
> > > > > > Hmmmm that attribute name looks problematic (more below in that section).
> > > > > With C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened. 
> > > > I think it makes sense on module export (the module author knows whether they are buffer-safe or not), but what do you have in mind for import where the user may not be in a position to make that assertion validly?
> > > > I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened.
> > > 
> > > It's usually not that valuable for us to know that outside the module. If the module wasn't hardened, there's nothing the "caller module" can do about it, the recommended approach is to harden the module first, and then use the attribute to indicate which functions were found to be unsafe in that process (especially if safe alternatives were implemented).
> > > 
> > > So it's only valuable when the module *will never be hardened*, and the only thing the caller module can do in such case is isolate calls to that module, maybe cover it with a safe abstraction. In this case, yeah, I think it makes sense. Maybe we should consider this eventually.
> > Wouldn't `only_safe_buffers` make sense on the import side to ensure that no unhardened code is being imported from a specific module?
> > Wouldn't only_safe_buffers make sense on the import side to ensure that no unhardened code is being imported from a specific module?
> 
> But the import side has no way to know that the module being imported has no unhardened code, only the module implementer knows that.
> 
> That said, I could imagine wanting to mark a module import as "will never be hardened" as a blanket way to say "don't complain about interfaces in this module I have no control over".
> 
> But now I'm questioning something from the docs:
> 
> > Pragmas ``unsafe_buffer_usage`` and ``only_safe_buffers`` and allow you to
> > annotate sections of code as either already-hardened or not-to-be-hardened,
> 
> It now occurs to me there's two ways to read this.
> 
> `unsafe_buffer_usage` = will diagnose any unsafe buffer usage, it's already hardened.
> `only_safe_buffers` == should only be used with known-safe buffers, it's not going to be hardened
> 
> or
> 
> `unsafe_buffer_usage` == this uses unsafe buffers, it's not to be hardened
> `only_safe_buffers` == this uses only safe buffers, it's already been hardened
> 
> I originally read it the first way from the docs in the patch but the comment from @xazax.hun reads to me like he read it the second way.
Yes our intent was to have
```
#pragma unsafe_buffer_usage begin

#pragma unsafe_buffer_usage end
```
annotate a section of code that intentionally uses unsafe buffers (so, warnings are suppressed). It makes sense to have `#pragma unsafe...` mean something's unsafe.

I wonder if it wouldn't have been confusing without my clumsy explanation.

@jkorous WDYT about this wording problem, do you see ambiguity here?

================
Comment at: clang/docs/SafeBuffers.rst:40-41
+  - Finally, in order to avoid bugs in newly converted code, the
+    Clang static analyzer provides a checker to find misconstructed
+    ``std::span`` objects.
+
----------------
aaron.ballman wrote:
> xazax.hun wrote:
> > jkorous wrote:
> > > NoQ wrote:
> > > > xazax.hun wrote:
> > > > > Isn't this overly specific? What about `string_view`s? 
> > > > Yeah I think we can cover string views as well!
> > > This is based on our current plans for near future.
> > > While we are positive we will use `std::span` in the FixIts it is very unlikely we'd use `string_view`. That's why having a checker for `std::span` is more important for us now.
> > Does this mean you'd recommend using a `span<char>` over `string_view` in all cases? I think that might surprise some users, so I wonder if it is worth documenting. 
> Oh, I was sort of assuming we'd be suggesting `string_view` over `span<char>` when appropriate. So agreed that we should clearly document this as it may surprise folks.
> 
> Out of curiosity, why the strong preference for `span`?
`string_view` is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view. `string_view`'s methods such as `find` will not necessarily make sense over original data, or may turn out to have counter-intuitive behavior when users try to modify the code further.

Additionally, `string_view` doesn't offer write access, so at best we'll be able to recommend it for `const` pointers.

So I think we should gather more data before considering it.

================
Comment at: clang/docs/SafeBuffers.rst:69-70
+containers such as ``std::span``, you can achieve bounds safety by
+*simply writing good modern C++ code*. This is what this solution is designed
+to help your codebase with.
+
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > This makes it sound like the proposal doesn't cover C because there is no version of `std::span` there.
> > I read this as a question as: "Could an imaginary codebase benefit from the warnings if there were no fixits?".
> > 
> > I can imagine projects where it would be very useful to have 90% of the code guaranteed to be free of "potentially dangerous pointer arithmetic" and have such operations contained to the remaining 10% of the code. I can imagine a principled solution to bounds-safety in C but even just splitting the code into safe and 9x smaller "unsafe" part could be extremely useful when it comes to debugging, audits, automated and manual testing and similar. Having clang guarantee this property reduces the scope of code that needs to be processed and a deliberate use of this model by projects can unlock new possibilities.
> > 
> > That is why I feel that the proposal can benefit also C projects.
> > I read this as a question as: "Could an imaginary codebase benefit from the warnings if there were no fixits?".
> 
> Not quite. It's whether an imaginary codebase can benefit from the warnings when there are no available standardized alternatives that are a better replacement and no C standard library hardening done. Basically, it feels like enabling this warning in a C code base is not the same as a C++ code base. In C++, it's "enable the warning, start using these awesome replacements, and get to a good state" and in C, it's "enable the warning, start implementing exemptions everywhere, and hope they're correct or that your test coverage saves you from mistakes."
> Basically, it feels like enabling this warning in a C code base is not the same as a C++ code base. In C++, it's "enable the warning, start using these awesome replacements, and get to a good state" and in C, it's "enable the warning, start implementing exemptions everywhere, and hope they're correct or that your test coverage saves you from mistakes."

Yes that's definitely true.

> This makes it sound like the proposal doesn't cover C because there is no version of `std::span` there.

Hmm, I didn't intend to consider this possibility *here*. This section is mostly motivational, I wanted to explain why the default recommended setup makes sense. That's why I start this section with a "Whelp for C it'd probably be better to build something completely different, even if this proposal is better than nothing".

I added a section "What About Plain C?" to discuss this once in detail, and let the rest of the proposal focus on C++.

================
Comment at: clang/docs/SafeBuffers.rst:90
+   manually ``delete[]`` it when appropriate.
+   (TODO: What about things like ``std::shared_ptr<T []>``)?
+2. When you deal with function or class interfaces that accept or return
----------------
aaron.ballman wrote:
> We should mention as part of the model description that pointer-chasing structures like linked lists, trees, etc are not covered by the model.
> 
> (That said, I suspect the *implementation* of such things might be covered. e.g., a tree with `Node *Children[]`)
Added a couple paragraphs above the list to cover that.

================
Comment at: clang/docs/SafeBuffers.rst:124
+  - Array subscript expression on raw arrays or raw pointers,
+      - unless the index is a compile-time constant ``0``,
+  - Increment and decrement of a raw pointer with operators ``++`` and ``--``;
----------------
NoQ wrote:
> xazax.hun wrote:
> > xazax.hun wrote:
> > > Isn't this too restrictive? How about arrays where both the index and the size of the array is known at compile time?
> > > 
> > > Also, what about subscripts in `consteval` code where the compiler should diagnose OOB accesses at compile time?
> > > 
> > > I believe this model can be made more ergonomic without losing any of the guarantees.
> > Small ping on this point. I think there are many code patterns that are completely safe (i.e., the compiler can diagnose OOB accesses), but the current model would ban. One example is converting an enum value to string using an array of string_views. In those cases, both enum consts' value and the array's size are known at compile time. I think those easy to diagnose special cases should be permitted to make programming more ergonomic. The more ergonomic the experience, the faster the adoption will be. 
> Yes, we'll probably add occasional suppressions for known safe cases like this, as long as we're sure we'll diagnose if safety is lost in later code changes. And then document them here.
> 
> It's probably even reasonable to suppress the warning when the index is a loop counter and the loop bound is known at compile time to be sufficiently small. Though it'll take some work to implement (re-use static analyzer's loop unrolling prerequisites code?)
> 
> (uh-oh, I thought I sent this reply before the conference)
Added a clause about known sizes. I'll remove it if we decide it's not feasible.

================
Comment at: clang/docs/SafeBuffers.rst:127
+  - Addition or subtraction of a number to/from a raw pointer with operators
+  ``+``, ``-``, ``+=``, ``-=``,
+      - unless that number is a compile time constant ``0``;
----------------
steakhal wrote:
> Sphinx is picky about indentation, and it also complains about the missing new-line prior to the sub-list.
> The same also applies to the very next sub-list.
Uh-oh, great catch!

================
Comment at: clang/docs/SafeBuffers.rst:128
+  ``+``, ``-``, ``+=``, ``-=``,
+      - unless that number is a compile time constant ``0``;
+      - subtraction between two pointers is also fine;
----------------
jkorous wrote:
> aaron.ballman wrote:
> > One case I think this is going to trip up is: `this + 1` (which is not uncommon in practice: https://sourcegraph.com/search?q=context:global+-file:.*test.*+file:.*%5C.cpp+this+%2B+1&patternType=standard). How do users fix up that pointer arithmetic and do they ever get any safety or security benefits from it? (Note, this doesn't apply to `+=`, just `+`.)
> That's a fair point and if we don't find a good solution we should ignore arithmetic with `this` and a constant.
I'll leave this unaddressed for now. It looks like these constructs typically *do* indicate that the remaining part of the object is a buffer. C++ doesn't seem to provide a good "safe" way to represent such objects. So, yeah, we need more data.

================
Comment at: clang/docs/SafeBuffers.rst:132
+    attribute ``[[unsafe_buffer_usage]]``,
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
----------------
aaron.ballman wrote:
> NoQ wrote:
> > aaron.ballman wrote:
> > > `NULL` as well, I presume? (This matters for C where `NULL` often expands to `(void *)0`.)
> > Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. Like, I didn't say it's a *literal*. Is there a better term?
> Are you okay with constant expressions? e.g.,
> ```
> consteval int foo() { return 0; }
> 
> void func(char *ptr) {
>   char *offset_ptr = ptr + foo(); // Is this fine?
> }
> ```
Yes sure, why not? If it ever becomes unsafe, we'll immediately warn.

================
Comment at: clang/docs/SafeBuffers.rst:148
+but only when the function is annotated with the attribute. The attribute gets
+auto-attached by automatic fixits, which let the warning and the fixes propagate
+across function boundaries, and the users are encouraged to annotate their
----------------
steakhal wrote:
> What does //auto-attached// mean?
Means they show up in fixits as suggested code. Clarified, thanks!

================
Comment at: clang/docs/SafeBuffers.rst:181-188
+Even though similar functionality can be achieved with generic
+``#pragma clang diagnostic``, a custom pragma is preferable because
+the generic solution’s push/pop mechanism may be confusing. In particular,
+it’s not immediately obvious from the code which mode gets enabled after
+a “pop”, which makes it harder to reason about safety both for you and for
+the purposes of security audit. It is also generally valuable for such
+annotation to describe the property of the annotated code,
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > > a custom pragma is preferable because the generic solution’s push/pop mechanism may be confusing.
> > > 
> > > Er, I'm not certain how much I agree with that. push/pop semantics on this have been around for a *long* time. I do agree that the user can write code such that another user may struggle to understand what the state of the system is after a pop operation, but that shouldn't be a deciding factor for whether we add a new pragma to do the same thing as another pragma. I think I'd rewrite this as:
> > > ```
> > > Similar functionality can be achieved with ``#pragma clang diagnostic``, but such use is discouraged because it
> > > reduces code readability due to being an imperative action rather than describing a property of the annotated
> > > code. That reduced readability also adds complexity to security code audits. Use of one of the dedicated
> > > pragmas is strongly recommended.
> > > ```
> > > or something along these lines. WDYT?
> > We'd love to arrive at a design that is very difficult to use incorrectly.
> > 
> > You are absolutely right that our preference for not using the diagnostic pragmas stems from the "untyped" pop operation.
> > 
> > Assuming this code is a starting point:
> > ```
> > #pragma clang diagnostic push warning "-Wfoo"
> > // code
> > #pragma clang diagnostic pop //end of -Wfoo
> > // more code
> > ```
> > 
> > if the user adds pragmas like this:
> > ```
> > #pragma clang diagnostic push warning "-Wfoo"
> > #pragma clang diagnostic push warning "-Wsafe-buffers-usage"
> > // code
> > #pragma clang diagnostic pop //end of -Wfoo
> > // more code
> > #pragma clang diagnostic pop //end of -Wsafe-buffers-usage
> > ```
> > 
> > they actually get this behavior:
> > ```
> > #pragma clang diagnostic push warning "-Wfoo"
> > #pragma clang diagnostic push warning "-Wsafe-buffers-usage"
> > // code
> > #pragma clang diagnostic pop //end of -Wsafe-buffers-usage
> > // more code
> > #pragma clang diagnostic pop //end of -Wfoo
> > ```
> > 
> > We should have the patch for pragmas ready to be put for review here in not too distant future. Maybe it'd be easier to discuss there?
> > We should have the patch for pragmas ready to be put for review here in not too distant future. Maybe it'd be easier to discuss there?
> 
> Certainly! I was mostly trying to reword what you had to keep the same intent but not make it sound like the basis for the pragmas is "we think these other pragmas, which do work, were not good enough" because that's not very technically compelling.
Reworded a bit and added the example!

================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
aaron.ballman wrote:
> jkorous wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > jkorous wrote:
> > > > > aaron.ballman wrote:
> > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > 
> > > > > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > > > > 
> > > > > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > 
> > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > > > > 
> > > > > Let's assume we're starting with this buggy code:
> > > > > 
> > > > > ```
> > > > > void caller() {
> > > > >   int buffer[10];
> > > > >   callee(buffer, 20);
> > > > > }
> > > > > 
> > > > > void callee(int* ptr, unsigned size) {
> > > > >   ptr[size-1] = 42;
> > > > >   ptr[size] = 42;
> > > > >   ptr[100] = 42;
> > > > > }
> > > > > ```
> > > > > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > > > > ```
> > > > > void caller() {
> > > > >   int buffer[10];
> > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > }
> > > > > 
> > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > >   ptr[size-1] = 42; // still unmitigated
> > > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > }
> > > > > ```
> > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > 
> > > > So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> > > > 
> > > > > ```
> > > > > void caller() {
> > > > >   int buffer[10];
> > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > }
> > > > > 
> > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > >   ptr[size-1] = 42; // still unmitigated
> > > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > }
> > > > > ```
> > > > 
> > > > Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".
> > > 
> > > Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.
> > > 
> > > So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.
> > > 
> > > On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.
> > > 
> > > Does that make sense?
> > The warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors.
> > We don't plan to target any other Standard Library implementation but the machinery we'll bring up should allow that to be implemented in the future.
> > Does that make sense?
> 
> It does, but the situation I'm thinking of is where the user is migrating their unsafe code to this new programming mode and aren't able to use libc++ to get the extra hardening checks. Consider this contrived example:
> ```
> extern "C" void func(int *array, size_t count) {
>   // Do buffer things with array and count
> }
> ```
> The user can't (easily) modify this code to accept a span directly because that could potentially break ABI. So they wrap the data in a span:
> ```
> extern "C" void func(int *array, size_t count) {
>   std::span<int> sp(array, count);
>   // Do buffer things with sp
> }
> ```
> In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses `span` -- but more trustworthy based on what? There's no hardened checks in the STL they're using and the potentially-wrong information is the same flavor of potentially-wrong in either case.
> 
> My fear is that we push users to make changes like that by telling them it's safer when it's actually functionally equivalent instead, but when they migrates their code they accidentally *introduce* a new bug where there wasn't one before.
> 
> I'm used to thinking about this sort of stuff as a kind of taint analysis. We need to know where that data comes from in order to know whether we can trust it or not. Did the "count" come from user input on the command line... or did it come from a call to `.size()` on some other container? These have different levels of trust (to me).
> 
> Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like `extern "C"` or the function linkage to try to reduce the amount of ABI breaks suggested?
> In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span

No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need `[[unsafe_buffer_usage]]`) under `extern "C"` linkage.

But on the C side, there's no expectation of having a safe alternative in the first place. Calls to this function become just another unsafe operation, which we can hope to isolate and reduce, but never to eliminate. So it's fine that a safe alternative isn't provided at all.

> Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like extern "C" or the function linkage to try to reduce the amount of ABI breaks suggested?

So, yes, we want fixes on interface boundaries, but we only want fixes that don't break source or binary compatibility. So our fixes will always preserve the original interface, possibly annotate it as unsafe, and then provide a safe alternative that call sites can, but *don't have to*, transition to. So in this case we're aiming for something like this:

```lang=c++
[[clang::unsafe_buffer_usage]]
extern "C" void func(int *array, size_t count) {
  func(std::span<int>(array, count));
}

void func(std::span<int> sp) {
  // Do buffer things with sp
}
```
where the original function is preserved, the code is not duplicated, and the safe function doesn't have `extern "C"`. In this case the original function is still unsafe (wears the attribute) because the container isn't coming from outside.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811