[PATCH] D136811: [-Wunsafe-buffer-usage] WIP: RFC: NFC: User documentation.
Aaron Ballman via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Wed Nov 30 05:19:12 PST 2022
aaron.ballman added inline comments.
================
Comment at: clang/docs/SafeBuffers.rst:40-41
+ - Finally, in order to avoid bugs in newly converted code, the
+ Clang static analyzer provides a checker to find misconstructed
+ ``std::span`` objects.
+
----------------
NoQ wrote:
> aaron.ballman wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > xazax.hun wrote:
> > > > > jkorous wrote:
> > > > > > NoQ wrote:
> > > > > > > xazax.hun wrote:
> > > > > > > > Isn't this overly specific? What about `string_view`s?
> > > > > > > Yeah I think we can cover string views as well!
> > > > > > This is based on our current plans for near future.
> > > > > > While we are positive we will use `std::span` in the FixIts it is very unlikely we'd use `string_view`. That's why having a checker for `std::span` is more important for us now.
> > > > > Does this mean you'd recommend using a `span<char>` over `string_view` in all cases? I think that might surprise some users, so I wonder if it is worth documenting.
> > > > Oh, I was sort of assuming we'd be suggesting `string_view` over `span<char>` when appropriate. So agreed that we should clearly document this as it may surprise folks.
> > > >
> > > > Out of curiosity, why the strong preference for `span`?
> > > `string_view` is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view. `string_view`'s methods such as `find` will not necessarily make sense over original data, or may turn out to have counter-intuitive behavior when users try to modify the code further.
> > >
> > > Additionally, `string_view` doesn't offer write access, so at best we'll be able to recommend it for `const` pointers.
> > >
> > > So I think we should gather more data before considering it.
> > > string_view is a bit harder to recommend automatically because whether a certain sequence of bytes constitutes a "string" is often a matter of opinion / user's point of view.
> >
> > But why does that matter for `string_view`? It's not a null-terminated interface, so it explicitly encodes a pointer/length interface, same as `span`.
> We can probably offer both variants in different note-fixits as soon as we teach the machine to recognize the different requirements for these strategies, and we can have `string_view` go first in the list?
>
> Like, I don't want to eliminate the question of programmer's intent from this. It's ok for the programmer to prefer `span<char>` in some cases, just to make the code more expressive, and we'll have to implement it anyway, so we might as well offer it. I definitely agree that `string_view` is the best choice when dealing with strings, and humans will absolutely prefer that and we should be recommending that to humans when applicable. I'm just not sure we should completely drop `span<char>` from automatic recommendations, and I admit that it'll take some extra work to implement recommending `string_view` instead, which we may or may not rush to implement asap.
> We can probably offer both variants in different note-fixits as soon as we teach the machine to recognize the different requirements for these strategies, and we can have string_view go first in the list?
I think that's reasonable enough; it doesn't have to do everything up front. So long as there's a plan (or at least an aspiration) to support multiple containers depending on context and preference, I'm happy enough. I had the (perhaps mistaken) impression this was heading towards *only* recommending `span`.
================
Comment at: clang/docs/SafeBuffers.rst:213
+
+The attribute is NOT warranted when the function has runtime protection against
+overflows, assuming hardened libc++, assuming that containers constructed
----------------
NoQ wrote:
> aaron.ballman wrote:
> > NoQ wrote:
> > > aaron.ballman wrote:
> > > > jkorous wrote:
> > > > > NoQ wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > jkorous wrote:
> > > > > > > > aaron.ballman wrote:
> > > > > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > > > >
> > > > > > > > > I think we should have more details on why the STL used at runtime doesn't matter, or if it still really does matter, we may want to reconsider the advice we give.
> > > > > > > > >
> > > > > > > > > Also, we don't give similar advice for use of the pragmas. Should we? (Maybe split the advice as to when to use markings in general out into its own section and reference it from both the pragma and attribute sections?)
> > > > > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used". But a significant portion of our users don't use libc++ (they use libstdc++ or MS STL for example).
> > > > > > > >
> > > > > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > > > > > With that the "container was used" distinction is absolutely crucial - that is what adds the bounds checks and already mitigates certain classes of bugs (even if the `span` passed in still has the same wrong size).
> > > > > > > >
> > > > > > > > Let's assume we're starting with this buggy code:
> > > > > > > >
> > > > > > > > ```
> > > > > > > > void caller() {
> > > > > > > > int buffer[10];
> > > > > > > > callee(buffer, 20);
> > > > > > > > }
> > > > > > > >
> > > > > > > > void callee(int* ptr, unsigned size) {
> > > > > > > > ptr[size-1] = 42;
> > > > > > > > ptr[size] = 42;
> > > > > > > > ptr[100] = 42;
> > > > > > > > }
> > > > > > > > ```
> > > > > > > > and transform it to (this is just a part of what we actually plan but hopefully illustrates the point):
> > > > > > > > ```
> > > > > > > > void caller() {
> > > > > > > > int buffer[10];
> > > > > > > > callee(std::span<int>(buffer, 20), 20);
> > > > > > > > }
> > > > > > > >
> > > > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > > > > ptr[size-1] = 42; // still unmitigated
> > > > > > > > ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > > > ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > > > }
> > > > > > > > ```
> > > > > > > > Our model depends on Standard Library used implementing the bounds checks.
> > > > > > >
> > > > > > > So will the model consider things like libstdc++ that don't implement the bounds checks? Or MSVC STL with Safe Library Support (https://learn.microsoft.com/en-us/cpp/standard-library/safe-libraries-cpp-standard-library?view=msvc-170) which seems to have some of the hardening you're talking about, but not all of it?
> > > > > > >
> > > > > > > > ```
> > > > > > > > void caller() {
> > > > > > > > int buffer[10];
> > > > > > > > callee(std::span<int>(buffer, 20), 20);
> > > > > > > > }
> > > > > > > >
> > > > > > > > void callee(std::span<int> ptr, unsigned size) {
> > > > > > > > ptr[size-1] = 42; // still unmitigated
> > > > > > > > ptr[size] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > > > ptr[100] = 42; // mitigated at runtime by std::span::operator[]() in hardened libc++
> > > > > > > > }
> > > > > > > > ```
> > > > > > >
> > > > > > > Understood; my point was that the comments in `callee()` are incorrect in other STL implementations. The model seems to presume all STLs are hardened (or will be in the relatively near future) but I don't think that's a valid presumption (for example, STLs exist which are focused solely on performance, like EASTL). So that makes me question whether the model *should* presume that container state is valid when it comes from an untrusted source.
> > > > > > > One thing I think is worth pointing out about these docs is that the first example effectively says "do add the attribute because the size passed in to the function could be wrong" and the second example says "don't add the attribute on the assumption that the container has the correct size information". The advice feels a bit conflicting to me because in one case we're assuming callers pass in incorrect values and in the other case we're assuming callers pass in correct values and the only distinction between them is a "container was used".
> > > > > >
> > > > > > Yeah that's an important thing for us to properly formulate, let me try to elaborate on what I meant.
> > > > > >
> > > > > > So we trust containers and spans, like we trust smart pointers: if you simply use them "normally", they're correct by construction. You have to go out of your way to get them wrong. Like, you take an array, it knows the size, you implicitly convert it to span, it's now well-formed, you pass it from function to function, it's still well-formed as it still carries the original bounds information, you subspan it... well this needs a runtime bounds check yeah, but other than that it's hard to get it wrong. And our tool is supposed to transform the entire program to do exactly this. And once we're there, none of these operations require manual intervention from the user, so there's very little room for error.
> > > > > >
> > > > > > On the contrary, if bounds information is passed separately, there's a lot of room for error every single time it's passed from function to function. There's literally nothing that prevents such information from going out of sync. There's nothing that prevents the user from passing a size of data he *intends to be written* into the buffer, instead of the size of the buffer. So we don't trust that.
> > > > > >
> > > > > > Does that make sense?
> > > > > The warnings about unsafe buffer usage would be valid no matter what Standard Library implementation is used and I can imagine the FixIts to exist in different flavors.
> > > > > We don't plan to target any other Standard Library implementation but the machinery we'll bring up should allow that to be implemented in the future.
> > > > > Does that make sense?
> > > >
> > > > It does, but the situation I'm thinking of is where the user is migrating their unsafe code to this new programming mode and aren't able to use libc++ to get the extra hardening checks. Consider this contrived example:
> > > > ```
> > > > extern "C" void func(int *array, size_t count) {
> > > > // Do buffer things with array and count
> > > > }
> > > > ```
> > > > The user can't (easily) modify this code to accept a span directly because that could potentially break ABI. So they wrap the data in a span:
> > > > ```
> > > > extern "C" void func(int *array, size_t count) {
> > > > std::span<int> sp(array, count);
> > > > // Do buffer things with sp
> > > > }
> > > > ```
> > > > In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses `span` -- but more trustworthy based on what? There's no hardened checks in the STL they're using and the potentially-wrong information is the same flavor of potentially-wrong in either case.
> > > >
> > > > My fear is that we push users to make changes like that by telling them it's safer when it's actually functionally equivalent instead, but when they migrates their code they accidentally *introduce* a new bug where there wasn't one before.
> > > >
> > > > I'm used to thinking about this sort of stuff as a kind of taint analysis. We need to know where that data comes from in order to know whether we can trust it or not. Did the "count" come from user input on the command line... or did it come from a call to `.size()` on some other container? These have different levels of trust (to me).
> > > >
> > > > Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like `extern "C"` or the function linkage to try to reduce the amount of ABI breaks suggested?
> > > > In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span
> > >
> > > No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need `[[unsafe_buffer_usage]]`) under `extern "C"` linkage.
> > >
> > > But on the C side, there's no expectation of having a safe alternative in the first place. Calls to this function become just another unsafe operation, which we can hope to isolate and reduce, but never to eliminate. So it's fine that a safe alternative isn't provided at all.
> > >
> > > > Btw, this brings up a different question about fix-it hints. Are you planning to add fix-it hints to interface boundaries? If so, are you planning to pay attention to things like extern "C" or the function linkage to try to reduce the amount of ABI breaks suggested?
> > >
> > > So, yes, we want fixes on interface boundaries, but we only want fixes that don't break source or binary compatibility. So our fixes will always preserve the original interface, possibly annotate it as unsafe, and then provide a safe alternative that call sites can, but *don't have to*, transition to. So in this case we're aiming for something like this:
> > >
> > > ```lang=c++
> > > [[clang::unsafe_buffer_usage]]
> > > extern "C" void func(int *array, size_t count) {
> > > func(std::span<int>(array, count));
> > > }
> > >
> > > void func(std::span<int> sp) {
> > > // Do buffer things with sp
> > > }
> > > ```
> > > where the original function is preserved, the code is not duplicated, and the safe function doesn't have `extern "C"`. In this case the original function is still unsafe (wears the attribute) because the container isn't coming from outside.
> > >>In both cases, the user is taking it on faith that the data passed into the function is correct. But you want the user to believe the second case is more trustworthy because it uses span
> > >No, that's definitely not our intention. That's why I'm explicitly specifying that this assumption only applies to containers constructed *outside* the function. In your example both functions should wear the attribute. And there's no way to produce a safe alternative (which wouldn't need [[unsafe_buffer_usage]]) under extern "C" linkage.
> >
> > My point is that the assumption seems invalid to me -- it's a matter of garbage-in, garbage-out, but you're saying "this potentially garbage-in situation is assumed to not happen" and then building analysis checks around that assumption. I realize we have to make some heuristic decisions for the analysis, but this one seems dangerous for something attempting to improve buffer safety because nothing validates that assumption. Or do you have plans for the future to add an analysis to cover the situation of shoving garbage into the span?
> >
> > To be clear, this is the situation I'm concerned about:
> > ```
> > // SomeLibraryTU.cpp
> > void func(int *buffer, size_t count) { // Old interface for compatibility
> > func(std::span<int>(buffer, count));
> > }
> >
> > void func(std::span<int> sp) { // New, safe interface
> > if (!sp.empty()) {
> > int *data = sp.data();
> > // Do stuff on data (other than pointer arithmetic, only looking at first element)
> > }
> > }
> >
> > // SomeApplicationTU.cpp
> > int main() {
> > func(0, 1); // Oops, wrong count!
> > }
> > ```
> > My complaint is that the library code is perfectly safe but the application code is not, and it sounds like this situation is purposefully not considered as part of this analysis... but it seems like this situation is the common case we'd *want* to catch.
> > Or do you have plans for the future to add an analysis to cover the situation of shoving garbage into the span?
>
> We do have plans for a static analyzer checker that warns about shoving garbage into the span! - that's the last bullet point in the intro. We expect it to be disproportionally effective within this programming model because the programming model recommends constructing the span "as early as possible", pretty much immediately after allocation, which makes the problem very local and therefore easy to detect with path-sensitive analysis.
>
> We also think it's much harder to shove garbage into span in the first place. The size parameter of the span is extremely unambiguous, there's nothing else you can think of putting there. With an arbitrary function you see for the first time, it's tempting to pass eg. the amount of data you expect to be written into the destination buffer, instead of the amount of room in the destination buffer. Span eliminates this problem because you know that whatever you put into the size parameter is a property of the span itself, it has nothing to do with the function you pass it into. In a lot of cases you don't even *have to* specify the size, you can simply implicitly-convert your container into a span of the right size.
>
> That said, this isn't exactly why we make this distinction. What we're trying to do here is reduce *the number of times* a span needs to be manually constructed (and then decomposed back into pointer and size, only to be constructed again). Because our goal is to construct every span manually at most once, right next to the allocation. And when you're literally looking at the allocation, the size becomes *obvious*. And after that, by the virtue of conforming to the model there's no longer a need to use unsafe wrap-unwrap operations over those spans, so everything becomes correct by construction.
>
> So despite not addressing the problem of initial span construction directly, this choice decreases the amount of times the programmer needs to solve this problem, and it decreases the complexity of this problem by making the programmer solve it only at those few moments of time when it's the most convenient to solve.
>
> And, yes, a static analyzer checker on top of that.
>
> > To be clear, this is the situation I'm concerned about:
>
> Hmm I'm not sure I understand this example, how does the lack of buffer operations in the new function kicks in here? It's definitely an expected false negative when a buffer of exactly one element is always used in the code and we therefore think that it's not a buffer at all. But why would you want to use a span at all in such case?
>
> And regardless of this problem, if the library vendor consciously provides a new interface in order to intentionally make the code compliant with our model, then they'd also add an attribute on the old interface (but not on the new interface) according to the requirements we're discussing. In this case a misuse of the old interface will issue a warning "please use the new interface" and a misuse of the new interface becomes more difficult due to the above considerations, so everything appears to be working as intended(?)
> We do have plans for a static analyzer checker that warns about shoving garbage into the span! - that's the last bullet point in the intro.
Ah, perfect!
> We also think it's much harder to shove garbage into span in the first place. The size parameter of the span is extremely unambiguous, there's nothing else you can think of putting there.
Hehe, oh to be this optimistic still. :-D The scenario I expect to see garbage in span is from data that can be user-controlled (the exact stuff you're worried about, to be honest!): pulling in size data from the network or a file format, etc. Something where it's not at all accidental that it's garbage going into the span, but is actually malicious input from an attacker.
>> To be clear, this is the situation I'm concerned about:
> Hmm I'm not sure I understand this example, how does the lack of buffer operations in the new function kicks in here?
It doesn't have to lack buffer operations, I was just trying to make the simplest example I could and I probably obscured my point. My point is mostly that a library interface that takes buffer + size and then wraps it into a span to pass to a span interface is every bit as dangerous as not having the span interface at all. If the data going into the span is tainted, it's game over. So I worry about *assuming* the data in the span is valid and basing diagnostic decisions around that. However, because you're going to also have checkers specifically trying to catch the case where invalid data CAN be shoved into the span (like a taint analysis, I hope?), that basically resolves my concerns.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D136811/new/
https://reviews.llvm.org/D136811
More information about the cfe-commits
mailing list