[clang] [llvm] [Clang] Correct __builtin_dynamic_object_size for subobject types (PR #78526)

Thu Jan 18 18:00:20 PST 2024

bwendling wrote:

> Perhaps according to the GCC documentation as written. But mode 0 and 1 are in general asking for an upper bound on the accessible bytes (that is, an N so any.access beyond N bytes is definitely out of bounds), so it seems to me that returning -1 is strictly worse than returning 48.

It's the second bit that controls whether it's an upper bound or lower bound that's returned, not the least significant bit.

> Do you have a use case for which -1 is a better answer?

I'm sure one could be constructed (for instance, someone wanting to check if it's okay to `strcpy` a string of size `48` to a pointer to `failed_devs[argc]` in the example above), but that's not really the point.

We're trying to implement a GNU builtin, and the only defined semantics we have to go on are GNU's documentation. I can't see how we can deviate from their documentation unless it's to say "we can't determine this value" and so return `-1` instead of an answer that might be wildly wrong and potentially cause a memory leak of some sort. In my made-up example, if we said, "Yes you can write up to 48 bytes into `p->failed_devs[argc]`, then a user may overwrite the two fields after `field_devs`. If we return `-1`, they'll have to take the "slow", but ideally more secure, route.

Lastly, returning `48` in the made-up case is different from the case when an immediate is used rather than a variable, which is certainly unexpected. We could document that we can't support GNU's implementation exactly, but I don't think that's good enough.

>From what I can tell, GNU's idea of what constitutes a "closest surrounding sub-object" is one of a handful of things: the structure containing the field, the array being pointed to with a struct, or the field being pointed into. Example of the last one:

```
struct foo {
    int a;
    int b;
    int c;
};

size_t foo(struct foo *p, int index) {
    return __builtin_dynamic_object_size(&((char *)p->b)[idx], 1); // b is the sub-object.
}
```

All of these are explicit in the LLVM IR. Is the worry that they've been changed from some transformations? Or are there others I'm missing?

https://github.com/llvm/llvm-project/pull/78526