[PATCH] D136811: -Wunsafe-buffer-usage: WIP: RFC: NFC: User documentation.

Thu Nov 3 15:36:21 PDT 2022

NoQ added inline comments.

================
Comment at: clang/docs/SafeBuffers.rst:31
+    convert large amounts of old code to conform to the warning;
+  - Attribute ``[[unsafe_buffer_usage]]`` lets you annotate custom functions as
+    unsafe, while providing a safe alternative that can often be suggested by
----------------
aaron.ballman wrote:
> xazax.hun wrote:
> > aaron.ballman wrote:
> > > Hmmmm that attribute name looks problematic (more below in that section).
> > With C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened. 
> I think it makes sense on module export (the module author knows whether they are buffer-safe or not), but what do you have in mind for import where the user may not be in a position to make that assertion validly?
> I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened.

It's usually not that valuable for us to know that outside the module. If the module wasn't hardened, there's nothing the "caller module" can do about it, the recommended approach is to harden the module first, and then use the attribute to indicate which functions were found to be unsafe in that process (especially if safe alternatives were implemented).

So it's only valuable when the module *will never be hardened*, and the only thing the caller module can do in such case is isolate calls to that module, maybe cover it with a safe abstraction. In this case, yeah, I think it makes sense. Maybe we should consider this eventually.

================
Comment at: clang/docs/SafeBuffers.rst:36-37
+    hardened mode where C++ classes such as ``std::vector`` and ``std::span``,
+    together with their respective ``iterator`` classes, are protected
+    at runtime against buffer overflows. This changes buffer overflows from
+    extremely dangerous undefined behavior to predictable runtime crash.
----------------
xazax.hun wrote:
> Only buffer overflows, so comparing cases like:
> ```
> auto it1 = v1.begin();
> auto it2 = v2.begin();
> 
> if (it1 == it2) { ... }
> ```
> 
> are out of scope, right? 
> 
> The main reason I'm asking, because I am not sure how safe iterators are implemented and whether the same implementation would give opportunities for other safety checks without significant additional cost. 
I suspect libc++ will have such checks (or even already does, @ldionne pls correct me if I'm wrong).

They're still out of scope for our proposal though. But they're definitely useful and may deserve mentioning.

================
Comment at: clang/docs/SafeBuffers.rst:40-41
+  - Finally, in order to avoid bugs in newly converted code, the
+    Clang static analyzer provides a checker to find misconstructed
+    ``std::span`` objects.
+
----------------
xazax.hun wrote:
> Isn't this overly specific? What about `string_view`s? 
Yeah I think we can cover string views as well!

================
Comment at: clang/docs/SafeBuffers.rst:115
+   (TODO: Will automatic fixits be able to suggest custom containers or views?)
+   (TODO: Explain how to implement such checks in a custom container?)
+
----------------
aaron.ballman wrote:
> xazax.hun wrote:
> > jkorous wrote:
> > > aaron.ballman wrote:
> > > > I think this would be a good idea, especially if you can use an example of a more involved (but still common) container. Like, how does someone harden a ring buffer?
> > > I believe that we should stick to hardening low-level primitives - all that a ring buffer implementation has to do is to store data in `std::array` from hardened libc++ and it's all set.
> > I think it would be great to link to some other resources like how to annotate your container such that address sanitizer can help you catch OOB accesses.
> > I believe that we should stick to hardening low-level primitives - all that a ring buffer implementation has to do is to store data in std::array from hardened libc++ and it's all set.
> 
> That doesn't help real world projects like LLVM where we replace the low-level primitives when they're not appropriate for our domain. I think we should assume that users will implement their own containers and support that situation from the start.
Yes, so I think we want to eventually cover low-level primitives like `SmallVector`, and then ask people to implement their higher-level primitives like ring buffers in terms of the covered low-level primitives.

The situation when people really need to have ultra-optimized high-level primitives that are implemented directly with raw buffers, and then optionally harden that, becomes significantly more rare. In such cases maybe it's ok not to have a clear step-by-step guide. But a ring buffer still makes a good example! I'll think more about that.

================
Comment at: clang/docs/SafeBuffers.rst:132
+    attribute ``[[unsafe_buffer_usage]]``,
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
----------------
aaron.ballman wrote:
> `NULL` as well, I presume? (This matters for C where `NULL` often expands to `(void *)0`.)
Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. Like, I didn't say it's a *literal*. Is there a better term?

================
Comment at: clang/docs/SafeBuffers.rst:194
+
+The attribute ``[[unsafe_buffer_usage]]`` should be placed on functions
+that need to be avoided as they may cause buffer overflows. It is designed to
----------------
aaron.ballman wrote:
> That's stealing a name from the C and C++ committees because it's not behind a vendor attribute namespace. I think it should be `[[clang::unsafe_buffer_usage]]` or `unsafe_buffer_usage` (with no specific syntax) instead.
> 
Yes, absolutely, I'll update that as we get closer to landing the attribute.

================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > 
> > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > 
> > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > 
> > Our model depends on Standard Library used implementing the bounds checks.
> > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > 
> > Let's assume we're starting with this buggy code:
> > 
> > ```
> > void caller() {
> >   int buffer[10];
> >   callee(buffer, 20);
> > }
> > 
> > void callee(int* ptr, unsigned size) {
> >   ptr[size-1] = 42;
> >   ptr[size] = 42;
> >   ptr[100] = 42;
> > }
> > ```
> > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > ```
> > void caller() {
> >   int buffer[10];
> >   callee(std::span<int>(buffer, 20), 20);
> > }
> > 
> > void callee(std::span<int> ptr, unsigned size) {
> >   ptr[size-1] = 42; // still unmitigated
> >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > }
> > ```
> > Our model depends on Standard Library used implementing the bounds checks.
> 
> So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> 
> > ```
> > void caller() {
> >   int buffer[10];
> >   callee(std::span<int>(buffer, 20), 20);
> > }
> > 
> > void callee(std::span<int> ptr, unsigned size) {
> >   ptr[size-1] = 42; // still unmitigated
> >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > }
> > ```
> 
> Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".

Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.

So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.

On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.

Does that make sense?

================
Comment at: clang/docs/SafeBuffers.rst:257-259
+* Use of a function annotated by such attribute causes ``-Wunsafe-buffer-usage``
+  warning to appear, instead of ``-Wdeprecated``, so they can be
+  enabled/disabled independently as all four combinations make sense;
----------------
aaron.ballman wrote:
> I don't think I understand what you're trying to say here.
I was trying to say that `[[deprecated]] is probably appropriate if the vendor of the function expects *all* users of the function to move away from it.

But if the vendor thinks that the function is generally fine, and it's only important to move away from it when the `-Wunsafe-buffer-usage` analysis is requested by the user, then `[[unsafe_buffer_usage]]` is more appropriate as it only causes a warning to be emitted under that condition, not every time.

I'll wordsmith.

================
Comment at: clang/docs/SafeBuffers.rst:260-263
+* The “replacement” parameter of ``[[deprecated]]``, which allows for automatic
+  fixits when the function has a drop-in replacement, becomes significantly more
+  powerful and flexible in ``[[unsafe_buffer_usage]]`` where it will allow
+  non-trivial automatic fixes.
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > The deprecated attribute does not have a replacement argument, it has an argument for some kind of message to be printed when the deprecated interface is used. So I'm not certain what this is telling me either.
> > For better or worse the deprecated attribute is "overloaded" and this refers to the two-parameter version.
> > 
> > https://clang.llvm.org/docs/AttributeReference.html#deprecated
> > 
> > "When spelled as __attribute__((deprecated)), the deprecated attribute can have two optional string arguments. The first one is the message to display when emitting the warning; the second one enables the compiler to provide a Fix-It to replace the deprecated name with a new name."
> > 
> > Since it seems confusing even to you - maybe we should take this as a signal that our users would get just as confused and remove the reference to deprecated?
> > WDYT?
> Ohhhh! I see now! What confused me is that the docs spell it as `[[deprecated]]` which does *not* have an overload that takes two arguments. If we keep this part of the docs, you should probably change it to `__attribute__((deprecated))` and link to the attribute docs for it.
> 
> > Since it seems confusing even to you - maybe we should take this as a signal that our users would get just as confused and remove the reference to deprecated?
> 
> Maybe we should, but because I'm still not quite clear on how this relates to the deprecated attribute yet, maybe you could reply with some code examples of deprecated and unsafe_buffer_usage and what diagnostics you get? If that helps clarify, maybe I can help reword the docs to improve clarity.
Uh-oh I didn't notice there's a difference. Indeed, will fix.

================
Comment at: clang/docs/SafeBuffers.rst:287-288
+
+The automatic fixit associated with the warning would transform this array
+into an ``std::array``:
+
----------------
xazax.hun wrote:
> I anticipate that doing these fixits for multi-dimensional arrays will be confusing. Is that in scope?
Yeah it's gonna be fun, it's in scope, we'll see how deep we'll be able to get. Ideally we want to transform them in one step. Of course the procedure is drastically different in `int [][]` (array of arrays, one contiguous chunk of memory) vs. `int **` (span of spans, each span in its own heap allocation).

================
Comment at: clang/docs/SafeBuffers.rst:365-370
+The fixits emitted by the warning are correct modulo placeholders. Placeholders
+are the only reason why fixed code is allowed to fail to compile.
+Incorrectly resolving the placeholder is the only reason why fixed code
+will demonstrate incorrect runtime behavior compared to the original code.
+In an otherwise well-formed program it is always possible (and usually easy)
+to resolve the placeholder correctly.
----------------
aaron.ballman wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > Because fixits can be applied automatically to an entire source file, we don't typically allow fix-its that can break code (or fixits where there are multiple choices for the user to pick from) unless they're attached to a *note* instead of a warning (or error).
> > > 
> > > Are you planning to follow that same pattern when placeholders are involved?
> > > 
> > > (For example, one potential use for automatic fix-its is to apply them automatically to source on commit before running precommit CI; similar to how folks sometimes run clang-format to automate "the source is different before/after the clang-format, so you should look into that".)
> > That's an extremely interesting question! Let me add our current thinking as a context for the discussion.
> > 
> > Our analysis will have its limits and will not be able to provide 100% complete and guaranteed correct code transformation in every case.
> > 
> > At the same time it sounds suboptimal both from our and our users perspective to not emit anything for such situations.
> > Imagine we have a code transformation that affects 20 lines of code and all we need is for the user to address a single function call argument. 
> > The thought of bailing, producing just a warning and having user to manually change all 20 lines of code sounds very unsatisfactory to me.
> > That's why we are thinking to use placeholders.
> > 
> > The next question is - can we not break the code when using placeholders?
> > By definition whenever we'd emit a placeholder our analysis can't provide a correct solution. That means that whatever way we'd make the code to compile can introduce a logic bug in the program. That is simply unacceptable given the overall goal of our effort.
> > That pretty much leaves us with only using placeholders that would break the build.
> > 
> > WDYT?
> > 
> > I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.
> > I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.
> 
> We document the expectations here: https://clang.llvm.org/docs/InternalsManual.html#fix-it-hints and the interesting bit is:
> 
> "Fix-it hints on errors and warnings need to obey these rules:
> 
> * Since they are automatically applied if -Xclang -fixit is passed to the driver, they should only be used when it’s very likely they match the user’s intent.
> * Clang must recover from errors as if the fix-it had been applied.
> * Fix-it hints on a warning must not change the meaning of the code. However, a hint may clarify the meaning as intentional, for example by adding parentheses when the precedence of operators isn’t obvious.
> 
> If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on notes are not applied automatically."
> 
> I think using placeholders on warnings rather than notes will be an issue for both the first and second bullets.
I like that! This way we can even emit multiple alternative fixes, eg.
```
note: change type of variable 'ptr' to 'std::span' to add bound checks
note: change type of variable 'ptr' to 'std::span::iterator' to add bound checks
```
(I'm still not sure we'll ever going to recommend iterators as a good solution, but they're generally a better drop-in replacement for a raw pointer, so despite their increased verbosity, we're considering them at least in some cases)

Another thing this can take care of is support for custom containers:
```
note: change type of variable 'vla' to 'std::vector' to add bound checks
note: change type of variable 'vla' to 'llvm::SmallVector' to add bound checks
```
 Now we can simply present all available containers to the user, and the user can select the one they like the most.

Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811