[libcxx-commits] [libcxx] [libc++][hardening] Check bounds on arithmetic in __bounded_iter (PR #78876)

Sun Feb 25 02:13:09 PST 2024

var-const wrote:

Thank you for writing up the detailed rationale! I'm convinced, and I think it's very valuable to have this decision written for potential future reference.

> ...by assuming this doesn't happen, we put ourselves in a very precarious position. We _want_ the optimizer to delete redundant bounds checks, and we want to avoid, as you say, tying the implementation too much to one particular optimizer. Those pressures mean we will want optimizers to get increasingly good at bounds analysis. But if we exhibit this UB, we will simultaneously want them to not get _too_ good at it. I don't think that's a good place to be in.

I think this is a very good way to put it.

> Providing `a != b` is inherently easier than proving `a < b` because it's a weaker statement to prove.

Putting this and the part I quoted above, it seems to come down to "Ensuring that bounded iterators are always valid (point to a valid array element or to the one-past-the-end element) is a stronger invariant, and thus can be expected to optimize better and should never optimize worse". This seems like a good intuitive rationale to me (but please correct me if I'm simplifying things too much here).

> Could you elaborate a bit more on this concern? To me, assertions that are more straightforwardly deletable are _less_ tied to the optimizer, because it's more likely to be universal. Whereas assertions that rely on more complex reasoning that our optimizer happen to handle would be more tied to it. This PR _improves_ the situation by this criteria!

There was a comment (sorry, I can't immediately find it, might have been a different PR or issue) that mentioned that the following assertion in the constructor of the bounded iterator:
```cpp
_LIBCPP_ASSERT_INTERNAL(__begin <= __end, "__bounded_iter(current, begin, end): [begin, end) is not a valid range");
```
used to be enabled in the fast mode and lead to better codegen (because it established an invariant that the optimizer could rely on). My concern was prompted by this specific part, but it's more general (and, to be clear, it's hypothetical -- this is a potential situation that I hope we won't find ourselves in, not a comment on the changes in this patch).

> I would push back a bit on this. I don't think this higher-level intuition is correct. The additional checks *move the assertions closer to their preconditions*. That is inherently easier for any optimizer to handle.

What I'm trying to express is that, from the language specification perspective, I think it's fair to say the optimizer doesn't exist. Or put another way, the Standard deals with the abstract machine, and IIRC, the abstract machine doesn't really have a notion of an optimizer, or at least not in any well-defined, structured way. Of course, the abstract machine is a big simplification that doesn't also have a notion of many other vital things -- this is in no way to say that the optimizer isn't important. But I think it provides a useful level of abstraction, and at that level of abstraction it makes sense to talk about the overhead or performance of individual operations, but I think it's fair to say there is no way one operation can indirectly affect another.

In that regard, it makes complete sense to me to say "between these two possible ways of writing a certain assertion, this one would have better performance". This analysis might break the abstraction a little bit, but at the end of the day we know that the optimizer is there, so I think that's quite acceptable. Fundamentally, we're still within the model where we analyze the effect (including the performance effect) of standalone independent statements. However, I feel that a statement like "Adding a condition statement here has an indirect effect on the following statement and makes it more efficient" is simply incompatible with this model. This is the source of my concern.

At the same time, of course we have to be pragmatic and performance is crucial here. I just feel that if we need to start relying on indirect effects of some statements (not in this patch, but it seems likely that this will happen in the future) -- "indirect" from the perspective of the abstract machine -- we need to find some way to formalize it and make it visible to the reader. When I read an assertion, I presume it literally means "Check the condition and halt the program if it doesn't hold". In reality, it might also provide crucial information to the optimizer, but that's an indirect effect that's invisible at the source code level. Comments can help a lot, but if we can think of a way to make the actual intent visible in the code, that would be great.

Once again, this is just something that I think we should be mindful of as we go, not an immediate concern with this patch.

https://github.com/llvm/llvm-project/pull/78876