[PATCH] D136811: WIP: RFC: NFC: C++ Buffer Hardening user documentation.

Thu Nov 3 10:15:23 PDT 2022

aaron.ballman added inline comments.

================
Comment at: clang/docs/SafeBuffers.rst:31
+    convert large amounts of old code to conform to the warning;
+  - Attribute ``[[unsafe_buffer_usage]]`` lets you annotate custom functions as
+    unsafe, while providing a safe alternative that can often be suggested by
----------------
xazax.hun wrote:
> aaron.ballman wrote:
> > Hmmmm that attribute name looks problematic (more below in that section).
> With C++ modules gaining some momentum, I wonder if it would be possible/beneficial to be able to annotate module import/export declarations with this attribute to signal that the APIs and the content in a module was not yet hardened. 
I think it makes sense on module export (the module author knows whether they are buffer-safe or not), but what do you have in mind for import where the user may not be in a position to make that assertion validly?

================
Comment at: clang/docs/SafeBuffers.rst:69-70
+containers such as ``std::span``, you can achieve bounds safety by
+*simply writing good modern C++ code*. This is what this solution is designed
+to help your codebase with.
+
----------------
jkorous wrote:
> aaron.ballman wrote:
> > This makes it sound like the proposal doesn't cover C because there is no version of `std::span` there.
> I read this as a question as: "Could an imaginary codebase benefit from the warnings if there were no fixits?".
> 
> I can imagine projects where it would be very useful to have 90% of the code guaranteed to be free of "potentially dangerous pointer arithmetic" and have such operations contained to the remaining 10% of the code. I can imagine a principled solution to bounds-safety in C but even just splitting the code into safe and 9x smaller "unsafe" part could be extremely useful when it comes to debugging, audits, automated and manual testing and similar. Having clang guarantee this property reduces the scope of code that needs to be processed and a deliberate use of this model by projects can unlock new possibilities.
> 
> That is why I feel that the proposal can benefit also C projects.
> I read this as a question as: "Could an imaginary codebase benefit from the warnings if there were no fixits?".

Not quite. It's whether an imaginary codebase can benefit from the warnings when there are no available standardized alternatives that are a better replacement and no C standard library hardening done. Basically, it feels like enabling this warning in a C code base is not the same as a C++ code base. In C++, it's "enable the warning, start using these awesome replacements, and get to a good state" and in C, it's "enable the warning, start implementing exemptions everywhere, and hope they're correct or that your test coverage saves you from mistakes."

================
Comment at: clang/docs/SafeBuffers.rst:115
+   (TODO: Will automatic fixits be able to suggest custom containers or views?)
+   (TODO: Explain how to implement such checks in a custom container?)
+
----------------
xazax.hun wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > I think this would be a good idea, especially if you can use an example of a more involved (but still common) container. Like, how does someone harden a ring buffer?
> > I believe that we should stick to hardening low-level primitives - all that a ring buffer implementation has to do is to store data in `std::array` from hardened libc++ and it's all set.
> I think it would be great to link to some other resources like how to annotate your container such that address sanitizer can help you catch OOB accesses.
> I believe that we should stick to hardening low-level primitives - all that a ring buffer implementation has to do is to store data in std::array from hardened libc++ and it's all set.

That doesn't help real world projects like LLVM where we replace the low-level primitives when they're not appropriate for our domain. I think we should assume that users will implement their own containers and support that situation from the start.

================
Comment at: clang/docs/SafeBuffers.rst:133
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
+        (such as std::memcpy() or std::next()) are hardcoded to act
----------------
jkorous wrote:
> aaron.ballman wrote:
> > Eventually we should extend this to have an exhaustive list.
> Absolutely! Is it ok if we populate the list as we land the implementation?
Populating the list as we land the implementation sounds great to me.

================
Comment at: clang/docs/SafeBuffers.rst:173
+  #pragma unsafe_buffer_usage begin
+
+Such pragmas not only enable incremental adoption with much smaller granularity,
----------------
jkorous wrote:
> aaron.ballman wrote:
> > Have you considered allowing a short-hand form of the pragmas that are scoped to the current block scope (or file scope if applied at global scope)? e.g.,
> > ```
> > void func() {
> >   {
> >     #pragma only_safe_buffers
> >     int stuff[10] = {};
> >     int x = stuff[1]; // Warning: you're a bad person who should feel bad
> >   }
> >   int array[10] = {};
> >   int y = array[1]; // this_is_fine.png
> > }
> > ```
> We discussed other options but generally prefer simple pragmas for their flexibility.
> Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> While this might be more verbose it would be dead simple to reason about.
> We discussed other options but generally prefer simple pragmas for their flexibility.

A single pragma with lexical scoping behaves like RAII, which is often a replacement for more complex acquire/release paired operations and is a pretty well-understood pattern in C++.

> Our current thinking is that for readability purposes we won't allow nesting of pragmas and would have end of scope to be explicit.
> While this might be more verbose it would be dead simple to reason about.

True, it is easy to reason about. But not allowing nesting for readability purposes removes some utility, especially for folks in C where macros are much more common. e.g.,
```
#define ONLY_SAFE_BUFFERS _Pragma("only_safe_buffers")

#define AWESOME_BUFFER_THING(ptr, size) ({ ONLY_SAFE_BUFFERS; (interesting work goes here) })

void func(int *ptr, size_t size) {
  ONLY_SAFE_BUFFERS;
  ... interesting code lives here ...
  AWESOME_BUFFER_THING(ptr, size); // Oops, this is nested.
 .. more interesting code lives here ...
}
```
(You can hit the same situation with paired pragmas.) Because C doesn't have constexpr or consteval support for functions or templates, function-like macros are a pretty common interface folks use in practice and macro expansion makes it somewhat hard to avoid potential surprising conflicts.

================
Comment at: clang/docs/SafeBuffers.rst:181-188
+Even though similar functionality can be achieved with generic
+``#pragma clang diagnostic``, a custom pragma is preferable because
+the generic solution’s push/pop mechanism may be confusing. In particular,
+it’s not immediately obvious from the code which mode gets enabled after
+a “pop”, which makes it harder to reason about safety both for you and for
+the purposes of security audit. It is also generally valuable for such
+annotation to describe the property of the annotated code,
----------------
jkorous wrote:
> aaron.ballman wrote:
> > > a custom pragma is preferable because the generic solution’s push/pop mechanism may be confusing.
> > 
> > Er, I'm not certain how much I agree with that. push/pop semantics on this have been around for a *long* time. I do agree that the user can write code such that another user may struggle to understand what the state of the system is after a pop operation, but that shouldn't be a deciding factor for whether we add a new pragma to do the same thing as another pragma. I think I'd rewrite this as:
> > ```
> > Similar functionality can be achieved with ``#pragma clang diagnostic``, but such use is discouraged because it
> > reduces code readability due to being an imperative action rather than describing a property of the annotated
> > code. That reduced readability also adds complexity to security code audits. Use of one of the dedicated
> > pragmas is strongly recommended.
> > ```
> > or something along these lines. WDYT?
> We'd love to arrive at a design that is very difficult to use incorrectly.
> 
> You are absolutely right that our preference for not using the diagnostic pragmas stems from the "untyped" pop operation.
> 
> Assuming this code is a starting point:
> ```
> #pragma clang diagnostic push warning "-Wfoo"
> // code
> #pragma clang diagnostic pop //end of -Wfoo
> // more code
> ```
> 
> if the user adds pragmas like this:
> ```
> #pragma clang diagnostic push warning "-Wfoo"
> #pragma clang diagnostic push warning "-Wsafe-buffers-usage"
> // code
> #pragma clang diagnostic pop //end of -Wfoo
> // more code
> #pragma clang diagnostic pop //end of -Wsafe-buffers-usage
> ```
> 
> they actually get this behavior:
> ```
> #pragma clang diagnostic push warning "-Wfoo"
> #pragma clang diagnostic push warning "-Wsafe-buffers-usage"
> // code
> #pragma clang diagnostic pop //end of -Wsafe-buffers-usage
> // more code
> #pragma clang diagnostic pop //end of -Wfoo
> ```
> 
> We should have the patch for pragmas ready to be put for review here in not too distant future. Maybe it'd be easier to discuss there?
> We should have the patch for pragmas ready to be put for review here in not too distant future. Maybe it'd be easier to discuss there?

Certainly! I was mostly trying to reword what you had to keep the same intent but not make it sound like the basis for the pragmas is "we think these other pragmas, which do work, were not good enough" because that's not very technically compelling.

================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
jkorous wrote:
> aaron.ballman wrote:
> > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > 
> > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > 
> > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> 
> Our model depends on Standard Library used implementing the bounds checks.
> With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> 
> Let's assume we're starting with this buggy code:
> 
> ```
> void caller() {
>   int buffer[10];
>   callee(buffer, 20);
> }
> 
> void callee(int* ptr, unsigned size) {
>   ptr[size-1] = 42;
>   ptr[size] = 42;
>   ptr[100] = 42;
> }
> ```
> and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> ```
> void caller() {
>   int buffer[10];
>   callee(std::span<int>(buffer, 20), 20);
> }
> 
> void callee(std::span<int> ptr, unsigned size) {
>   ptr[size-1] = 42; // still unmitigated
>   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
>   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> }
> ```
> Our model depends on Standard Library used implementing the bounds checks.

So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?

> ```
> void caller() {
>   int buffer[10];
>   callee(std::span<int>(buffer, 20), 20);
> }
> 
> void callee(std::span<int> ptr, unsigned size) {
>   ptr[size-1] = 42; // still unmitigated
>   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
>   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> }
> ```

Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.

================
Comment at: clang/docs/SafeBuffers.rst:260-263
+* The “replacement” parameter of ``[[deprecated]]``, which allows for automatic
+  fixits when the function has a drop-in replacement, becomes significantly more
+  powerful and flexible in ``[[unsafe_buffer_usage]]`` where it will allow
+  non-trivial automatic fixes.
----------------
jkorous wrote:
> aaron.ballman wrote:
> > The deprecated attribute does not have a replacement argument, it has an argument for some kind of message to be printed when the deprecated interface is used. So I'm not certain what this is telling me either.
> For better or worse the deprecated attribute is "overloaded" and this refers to the two-parameter version.
> 
> https://clang.llvm.org/docs/AttributeReference.html#deprecated
> 
> "When spelled as __attribute__((deprecated)), the deprecated attribute can have two optional string arguments. The first one is the message to display when emitting the warning; the second one enables the compiler to provide a Fix-It to replace the deprecated name with a new name."
> 
> Since it seems confusing even to you - maybe we should take this as a signal that our users would get just as confused and remove the reference to deprecated?
> WDYT?
Ohhhh! I see now! What confused me is that the docs spell it as `[[deprecated]]` which does *not* have an overload that takes two arguments. If we keep this part of the docs, you should probably change it to `__attribute__((deprecated))` and link to the attribute docs for it.

> Since it seems confusing even to you - maybe we should take this as a signal that our users would get just as confused and remove the reference to deprecated?

Maybe we should, but because I'm still not quite clear on how this relates to the deprecated attribute yet, maybe you could reply with some code examples of deprecated and unsafe_buffer_usage and what diagnostics you get? If that helps clarify, maybe I can help reword the docs to improve clarity.

================
Comment at: clang/docs/SafeBuffers.rst:352
+
+  - The compatibility overload contains a ``__placeholder__`` which needs
+    to be populated manually. In this case ``size`` is a good candidate.
----------------
xazax.hun wrote:
> Just a small concern. While users are not allowed to use identifiers like this, there is always a chance someone has a global variable of this name and the code ends up compiling. So, if we want to be absolutely safe, we should either have logic to pick a unique name, or not generate fixits in that case. Although, this is probably rare enough that we could just hand-wave.
Agreed; we should not have a fixit that suggests use of a reserved identifier.

================
Comment at: clang/docs/SafeBuffers.rst:365-370
+The fixits emitted by the warning are correct modulo placeholders. Placeholders
+are the only reason why fixed code is allowed to fail to compile.
+Incorrectly resolving the placeholder is the only reason why fixed code
+will demonstrate incorrect runtime behavior compared to the original code.
+In an otherwise well-formed program it is always possible (and usually easy)
+to resolve the placeholder correctly.
----------------
jkorous wrote:
> aaron.ballman wrote:
> > Because fixits can be applied automatically to an entire source file, we don't typically allow fix-its that can break code (or fixits where there are multiple choices for the user to pick from) unless they're attached to a *note* instead of a warning (or error).
> > 
> > Are you planning to follow that same pattern when placeholders are involved?
> > 
> > (For example, one potential use for automatic fix-its is to apply them automatically to source on commit before running precommit CI; similar to how folks sometimes run clang-format to automate "the source is different before/after the clang-format, so you should look into that".)
> That's an extremely interesting question! Let me add our current thinking as a context for the discussion.
> 
> Our analysis will have its limits and will not be able to provide 100% complete and guaranteed correct code transformation in every case.
> 
> At the same time it sounds suboptimal both from our and our users perspective to not emit anything for such situations.
> Imagine we have a code transformation that affects 20 lines of code and all we need is for the user to address a single function call argument. 
> The thought of bailing, producing just a warning and having user to manually change all 20 lines of code sounds very unsatisfactory to me.
> That's why we are thinking to use placeholders.
> 
> The next question is - can we not break the code when using placeholders?
> By definition whenever we'd emit a placeholder our analysis can't provide a correct solution. That means that whatever way we'd make the code to compile can introduce a logic bug in the program. That is simply unacceptable given the overall goal of our effort.
> That pretty much leaves us with only using placeholders that would break the build.
> 
> WDYT?
> 
> I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.
> I wasn't aware of the precedent related to fixits attached to notes not having to lead to compilable code and we should consider that path.

We document the expectations here: https://clang.llvm.org/docs/InternalsManual.html#fix-it-hints and the interesting bit is:

"Fix-it hints on errors and warnings need to obey these rules:

* Since they are automatically applied if -Xclang -fixit is passed to the driver, they should only be used when it’s very likely they match the user’s intent.
* Clang must recover from errors as if the fix-it had been applied.
* Fix-it hints on a warning must not change the meaning of the code. However, a hint may clarify the meaning as intentional, for example by adding parentheses when the precedence of operators isn’t obvious.

If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on notes are not applied automatically."

I think using placeholders on warnings rather than notes will be an issue for both the first and second bullets.

Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811