[PATCH] D136811: [-Wunsafe-buffer-usage] WIP: RFC: NFC: User documentation.

Aaron Ballman via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Nov 29 05:54:24 PST 2022


aaron.ballman added inline comments.


================
Comment at: clang/docs/SafeBuffers.rst:92
+
+However, no automatic code modernizer for plain C is not provided,
+and the hardened C++ standard library cannot benefit C code, which limits
----------------



================
Comment at: clang/docs/SafeBuffers.rst:40-41
+  - Finally, in order to avoid bugs in newly converted code, the
+    Clang static analyzer provides a checker to find misconstructed
+    ``std::span`` objects.
+
----------------
NoQ wrote:
> aaron.ballman wrote:
> > xazax.hun wrote:
> > > jkorous wrote:
> > > > NoQ wrote:
> > > > > xazax.hun wrote:
> > > > > > Isn't this overly specific? What about `string_view`s? 
> > > > > Yeah I think we can cover string views as well!
> > > > This is based on our current plans for near future.
> > > > While we are positive we will use `std::span` in the FixIts it is very unlikely we'd use `string_view`. That's why having a checker for `std::span` is more important for us now.
> > > Does this mean you'd recommend using a `span<char>` over `string_view` in all cases? I think that might surprise some users, so I wonder if it is worth documenting. 
> > Oh, I was sort of assuming we'd be suggesting `string_view` over `span<char>` when appropriate. So agreed that we should clearly document this as it may surprise folks.
> > 
> > Out of curiosity, why the strong preference for `span`?
> `string_view` is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view. `string_view`'s methods such as `find` will not necessarily make sense over original data, or may turn out to have counter-intuitive behavior when users try to modify the code further.
> 
> Additionally, `string_view` doesn't offer write access, so at best we'll be able to recommend it for `const` pointers.
> 
> So I think we should gather more data before considering it.
> string_view is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view.

But why does that matter for `string_view`? It's not a null-terminated interface, so it explicitly encodes a pointer/length interface, same as `span`.


================
Comment at: clang/docs/SafeBuffers.rst:128
+  ``+``, ``-``, ``+=``, ``-=``,
+      - unless that number is a compile time constant ``0``;
+      - subtraction between two pointers is also fine;
----------------
NoQ wrote:
> jkorous wrote:
> > aaron.ballman wrote:
> > > One case I think this is going to trip up is: `this + 1` (which is not uncommon in practice: https://sourcegraph.com/search?q=context:global+-file:.*test.*+file:.*%5C.cpp+this+%2B+1&patternType=standard). How do users fix up that pointer arithmetic and do they ever get any safety or security benefits from it? (Note, this doesn't apply to `+=`, just `+`.)
> > That's a fair point and if we don't find a good solution we should ignore arithmetic with `this` and a constant.
> I'll leave this unaddressed for now. It looks like these constructs typically *do* indicate that the remaining part of the object is a buffer. C++ doesn't seem to provide a good "safe" way to represent such objects. So, yeah, we need more data.
Fair enough -- it's worth documenting explicitly as an open question. FWIW, I agree that the construct is mostly used for accessing trailing objects as a buffer. The closest thing I've seen to a safe way to handle this is how LLVM does it with the `TrailingObjects` mixin.


================
Comment at: clang/docs/SafeBuffers.rst:132
+    attribute ``[[unsafe_buffer_usage]]``,
+      - unless the pointer is a compile time constant ``0`` or ``nullptr``;
+      - a number of C/C++ standard library buffer manipulation functions
----------------
NoQ wrote:
> aaron.ballman wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > `NULL` as well, I presume? (This matters for C where `NULL` often expands to `(void *)0`.)
> > > Hmm, I was thinking that `((void *)0)` is a compile-time constant `0`. Like, I didn't say it's a *literal*. Is there a better term?
> > Are you okay with constant expressions? e.g.,
> > ```
> > consteval int foo() { return 0; }
> > 
> > void func(char *ptr) {
> >   char *offset_ptr = ptr + foo(); // Is this fine?
> > }
> > ```
> Yes sure, why not? If it ever becomes unsafe, we'll immediately warn.
Ah, C terminology is biting me here -- we use the term "constant" the same way C++ uses the term "literal", so when I read "constant 0" my brain interpreted that as literally just `0`.


================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
NoQ wrote:
> aaron.ballman wrote:
> > jkorous wrote:
> > > NoQ wrote:
> > > > aaron.ballman wrote:
> > > > > jkorous wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".  But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > > 
> > > > > > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > > > > > 
> > > > > > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > 
> > > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > > > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > > > > > 
> > > > > > Let's assume we're starting with this buggy code:
> > > > > > 
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(buffer, 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(int* ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42;
> > > > > >   ptr[size] = 42;
> > > > > >   ptr[100] = 42;
> > > > > > }
> > > > > > ```
> > > > > > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42; // still unmitigated
> > > > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > }
> > > > > > ```
> > > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > > 
> > > > > So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> > > > > 
> > > > > > ```
> > > > > > void caller() {
> > > > > >   int buffer[10];
> > > > > >   callee(std::span<int>(buffer, 20), 20);
> > > > > > }
> > > > > > 
> > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > >   ptr[size-1] = 42; // still unmitigated
> > > > > >   ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > >   ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > }
> > > > > > ```
> > > > > 
> > > > > Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".
> > > > 
> > > > Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.
> > > > 
> > > > So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.
> > > > 
> > > > On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.
> > > > 
> > > > Does that make sense?
> > > The warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors.
> > > We don't plan to target any other Standard Library implementation but the machinery we'll bring up should allow that to be implemented in the future.
> > > Does that make sense?
> > 
> > It does, but the situation I'm thinking of is where the user is migrating their unsafe code to this new programming mode and aren't able to use libc++ to get the extra hardening checks. Consider this contrived example:
> > ```
> > extern "C" void func(int *array, size_t count) {
> >   // Do buffer things with array and count
> > }
> > ```
> > The user can't (easily) modify this code to accept a span directly because that could potentially break ABI. So they wrap the data in a span:
> > ```
> > extern "C" void func(int *array, size_t count) {
> >   std::span<int> sp(array, count);
> >   // Do buffer things with sp
> > }
> > ```
> > In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses `span` -- but more trustworthy based on what? There's no hardened checks in the STL they're using and the potentially-wrong information is the same flavor of potentially-wrong in either case.
> > 
> > My fear is that we push users to make changes like that by telling them it's safer when it's actually functionally equivalent instead, but when they migrates their code they accidentally *introduce* a new bug where there wasn't one before.
> > 
> > I'm used to thinking about this sort of stuff as a kind of taint analysis. We need to know where that data comes from in order to know whether we can trust it or not. Did the "count" come from user input on the command line... or did it come from a call to `.size()` on some other container? These have different levels of trust (to me).
> > 
> > Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like `extern "C"` or the function linkage to try to reduce the amount of ABI breaks suggested?
> > In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span
> 
> No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need `[[unsafe_buffer_usage]]`) under `extern "C"` linkage.
> 
> But on the C side, there's no expectation of having a safe alternative in the first place. Calls to this function become just another unsafe operation, which we can hope to isolate and reduce, but never to eliminate. So it's fine that a safe alternative isn't provided at all.
> 
> > Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like extern "C" or the function linkage to try to reduce the amount of ABI breaks suggested?
> 
> So, yes, we want fixes on interface boundaries, but we only want fixes that don't break source or binary compatibility. So our fixes will always preserve the original interface, possibly annotate it as unsafe, and then provide a safe alternative that call sites can, but *don't have to*, transition to. So in this case we're aiming for something like this:
> 
> ```lang=c++
> [[clang::unsafe_buffer_usage]]
> extern "C" void func(int *array, size_t count) {
>   func(std::span<int>(array, count));
> }
> 
> void func(std::span<int> sp) {
>   // Do buffer things with sp
> }
> ```
> where the original function is preserved, the code is not duplicated, and the safe function doesn't have `extern "C"`. In this case the original function is still unsafe (wears the attribute) because the container isn't coming from outside.
>>In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span
>No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need [[unsafe_buffer_usage]]) under extern "C" linkage.

My point is that the assumption seems invalid to me -- it's a matter of garbage-in, garbage-out, but you're saying "this potentially garbage-in situation is assumed to not happen" and then building analysis checks around that assumption. I realize we have to make some heuristic decisions for the analysis, but this one seems dangerous for something attempting to improve buffer safety because nothing validates that assumption. Or do you have plans for the future to add an analysis to cover the situation of shoving garbage into the span?

To be clear, this is the situation I'm concerned about:
```
// SomeLibraryTU.cpp
void func(int *buffer, size_t count) { // Old interface for compatibility
  func(std::span<int>(buffer, count));
}

void func(std::span<int> sp) {  // New, safe interface
  if (!sp.empty()) {
    int *data = sp.data();
    // Do stuff on data (other than pointer arithmetic, only looking at first element)
  }
}

// SomeApplicationTU.cpp
int main() {
  func(0, 1); // Oops, wrong count!
}
```
My complaint is that the library code is perfectly safe but the application code is not, and it sounds like this situation is purposefully not considered as part of this analysis... but it seems like this situation is the common case we'd *want* to catch.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136811/new/

https://reviews.llvm.org/D136811



More information about the cfe-commits mailing list