[llvm-commits] PATCH: Adding an array constructor to StringRef

David Blaikie dblaikie at gmail.com
Mon Jun 18 09:56:30 PDT 2012


On Mon, Jun 18, 2012 at 9:48 AM, Marshall Clow <mclow.lists at gmail.com> wrote:
> On Jun 18, 2012, at 8:47 AM, David Blaikie wrote:
>> On Mon, Jun 18, 2012 at 8:36 AM, Marshall Clow <mclow.lists at gmail.com> wrote:
>>> Modeled on the one in ArrayRef
>>>
>>>    /// Construct an string ref from a C array.
>>>    template <size_t N>
>>>    /*implicit*/ StringRef(const char (&Arr)[N])
>>>      : Data(Arr), Length(N) {}
>>>
>>
>> One possible problem with this is that if you construct a StringRef
>> from a string literal, you'll end up including the '\0' inside it.
>
> Ok - I can see that.
>>
>> (also, when I tried this myself a few months ago, there's at least one
>> case where a StringRef is constructed from a large fixed-size buffer
>> on the stack - when using this ctor, it obviously gets the size quite
>> wrong)
>
> Interesting. Do you remember where that was?

Not precisely, but if you add the assertion I mentioned you should
find all the cases where your change produces different behavior &
will at least want to glance at each one to ensure that the change is
reasonable (I suspect some changes may be benign, but I doubt any are
actually desirable/reasonable - hence the change to truncate any
trailing '\0' to remain consistent with existing behavior without
having to change lots of existing callsites)

>> That being said, I like the idea, I'm just not sure the right way to
>> implement it - when I was experimenting with this I implemented it
>> with a '\0' check for the last character (truncating the size by one
>> when present, and leaving it at the full size when it wasn't present)
>
> Ok.
>
>> - and added an assert for strlen(Data) == Length to catch the buffer
>> case (which means the ctor didn't provide performance improvements in
>> debug builds but only in release builds).
>
> I see three problems here:
> 1) Constructing from string literal gives one too large. This is easily fixed.
> 2) Your case above.
> 3) The general case, where you have an array that may contain a NULL somewhere in the middle.
>
> (These last two might turn out to be the same case.)
>
> The second case, I don't see as a problem, given that the ( data, len ) constructor does not enforce the invariant that StringRef::Length () == strlen ( StringRef.data()).
> I know that StringRef.data() doesn't have to be NULL-terminated, but I don't see any care being taken to deal with embedded NULL characters, either.
>
> Maybe the confusion here is due to the fact that this would create a second one-argument constructor, and it would not be clear which was being chosen:
>
>        StringRef ( array )     - gives a reference to the entire array
>        StringRef ( pointer )  - gives a reference to the (null-terminated) string
>
> Ok - I'm not sure I see a good way to disambiguate those (or rather, for the user to know which is being called, given that arrays decay to pointers if you look at them funny)
>
> Dang.  Any ideas?

So the reasoning about StringRef not taking care to avoid embedded
'\0's is valid, but entirely relevant because, as you mention, the
issue is that existing callers are using StringRef(pointer) and
expecting certain behavior which will change in the presence of this
new ctor. So it's a matter of auditing those callers to make sure they
either don't care (unlikely) or are fixed (in the (2) I mentioned, it
was easy enough just to cast the array to a pointer explicitly & get
the old behavior).

Chandler's probably right though & that's one of the reasons I didn't
bother pursuing this further - mind you I never actually looked to
ensure that Clang was optimizing this well, but I figured the chances
were good enough that the benefit might not be worth my time to
investigate, etc.

- David




More information about the llvm-commits mailing list