[lldb-dev] raw c strings in lldb

Wed Feb 11 14:45:35 PST 2015

> On 11.02.2015, at 23:04, Greg Clayton <gclayton at apple.com> wrote:
> 
> 
>> On Feb 11, 2015, at 11:23 AM, Zachary Turner <zturner at google.com> wrote:
>> 
>> 
>> 
>> On Wed Feb 11 2015 at 11:13:22 AM Greg Clayton <gclayton at apple.com> wrote:
>> 
>>> On Feb 11, 2015, at 10:18 AM, Zachary Turner <zturner at google.com> wrote:
>>> 
>>> A patch is up which fixes a number of issues with strings not being null terminated in LLDB.  The issues and error-proneness of using raw c strings is well documented and understood, so I would like to propose banning them in LLDB for new code [1].
>>> 
>>> I realize that's a tall order, and I don't expect all the occurrences of them to go away overnight, but issues related to strings have been popping up over and over again, and it would be nice if we could deal with more interesting problems.
>>> 
>>> LLVM has a very robust string manipulation library, and while it's not perfect, it would have prevented almost all of these issues from ever happening in the first place.  Here is a quick summary of common C string types / operations and their corresponding LLVM equivalents:
>>> 
>>> Types:
>>> char stack_buffer[n];        // Use llvm::SmallString<n>.
>> 
>> That would be fine.
>> 
>>> const char* foo;             // Use llvm::StringRef foo.
>> 
>> This may be fine as long as there is absolutely _no_ string manipulation on "foo" (like "strcat(...) etc. Use std::string for anything that requires storage. And we don't want to start using any code that does "let me assign an llvm::StringRef() with a ConstString("hello").GetCString() so I don't have to worry about ownership.  Adding string to the constant string pool should be reserved for things that actually need it (any const names like namespaces, variable names, type names, paths, etc) and also for when we return "const char *" through the public API.
>> 
>> This also means that we can't trust that "foo" is NULL terminated (llvm::StringRef objects have a data pointer and a length and aren't required to be terminated) and it means we must call "foo.str().c_str()" any time we require NULL termination. So I actually don't like llvm::StringRef as a replacement "const char *"...
> 
>> We can't assume that const char*s are null terminated either.  
> 
> Yes we can, that is was is assumed by const char * right now for every call we use it with.
> 
>> They're just pointers to raw buffers, which you may or may not have null terminated.  So we have the same issue.  
> 
> No we don't, it isn't part of the the way you normally use c-strings. llvm::StringRef doesn't need to NULL terminate as it is part of its design.
> 
>> By replacing all uses of const char* with StringRef(), we end up with exactly the same semantics.  null terminated strings are null terminated, non null terminated strings aren't.
> 
> But that isn't what is currently meant by _any_ of the functions that take a "const char *" with no length parameter, so it means that any switch to using llvm::StringRef requires we use ".str().c_str()" which I would like to avoid.
>> 
>> 
>> 
>>> Passing arguments;
>>> void foo(char *, int len);   // Use void foo(llvm::SmallVectorImpl<char> &buf);
>> 
>> I would prefer to use std::string here. Clearer and everyone knows how to use it. SmallVectors are nice, but they have the overhead of storing more data on the stack. Clang currently can create quite large stack frames when you have some code that has large switch statements where each switch statement has local variables. If those switch statements how SmallVector<char, PATH_MAX>, then you now have a case where if this code is called recursively we end up blowing the stack. We have seen this a lot in the DWARF parser, so I don't want to encourage using SmallVector for strings, since many places we use strings are for paths and other larger buffers. Any current std::string implementation can store up to 24 characters in the std::string without needing to do any allocations on the heap, so that essentially gets us a
> 
> 
>> SmallVector<char, 24>. So I vote std::string.
> 
> 
>> passing SmallVectorImpl<> is intended as a replacement for the case where you have this:
>> 
>> void foo(char *buf);
>> void bar() {
>>    char buffer[n];
>>    foo(buffer);
>> }
>> 
>> The equivalent LLVM code is this:
>> 
>> void foo(llvm::SmallVectorImpl<char> &buf);
>> void bar() {
>>    llvm::SmallString<n> buffer;
>>    foo(buffer);
>> }
>> 
>> Exactly the same amount of stack space.  It's not intended to be a general replacement for std::string, which stores its data entirely on the heap.  
> 
> std::string doesn't always store content on the heap. It stores the first ~20 bytes in the std::string itself (at least for any STL implementation that is worth anything). So I really don't see the need to start using a new string class when std::string is just about all we need and it will take up less space on the stack for reasons I mentioned above with switch statements (3 pointers at most).
> 
> 
>> I do think that many places in LLDB that currently use stack buffers could be re-written just as well using heap buffers, and for that I agree std::string is fine.
> 
> Agreed.
> 
>> 
>> 
>>> void foo(const char * foo);  // Use void foo(llvm::StringRef foo);
>> 
>> 
>> Again, I don't like this because you must call "foo.str().c_str()" in order to guarantee NULL termination.
> 
> 
>> As mentioned previously, in the const char * version you have to guarantee that it was initialized with a null terminated string in the first place.  The burden is the same either way.  Either we're already making the assumption in the callee, in which case we can continue to make it with the StringRef(), or we aren't making the assumption in the callee, in which case we can continue not making the assumption but have the added benefit of having the length field passed to us.
> 
> I disagree. Passing "const char*" assumes NULL termination unless a length field is also supplied. Passing llvm::StringRef as an object that is known to not NULL terminate, then we have can't assume termination. So I don't agree that passing llvm::StringRef makes anything better.
> 
>> 
>> 
>>> 
>>> Operations:
>>> strcpy, strncpy, etc         // Use member functions of SmallString / StringRef
>>> sprintf, snprintf, etc       // Use llvm::raw_ostream
>> 
>> We have StringStream, people should be using that. raw_ostream doesn't support all printf style formats. So all sprintf  and snprintf function calls should switch to using lldb_private::StreamString.
> 
>> I actually disagree here.  Printf() has plenty of problems, such as non-portable and confusing arguments,
> 
> I agree on this one, but we are getting around this with using PRI macros.
> 
>> lack of type safety,
> 
> most of this is taken care of when you have the printf format warnings on your class.
> 
>> possibility of buffer overruns.
> 
> There are no buffer overruns in the Stream::Printf(). Check the implementation if you need to.
> 
>> I would actually prefer to use raw_ostream instead of StringStream.
> 
> I don't. I have written tools using streaming style stuff before (like std::cout) and also written using printf style stuff and I find the printf style stuff more readable and maintainable. I also don't like streams that maintain state like the std::ostream does. Not sure if raw_ostream does any state management, but when they do it causes bugs later for code that doesn't always explicitly set the format, width, precision and much more:
> 
> std::ostream &os;
> os << std::hex << hex_num;
> dump(os);
> os << hex_num2;
> 
> little did we know the stream was modified by "dump()" to set the format to decimal so now you must always use:
> 
> os << std::hex << hex_num2;
> 
> (repeat for number width, precision and many other factors).

raw_ostream is designed to carry as little state as possible and is essentially a subset of lldb::Stream now (it also does printf-style formatting via llvm::format). At some point I pondered implementing StringStream on top of raw_ostream to improve interoperability but it required a lot of work for very little gain; the interfaces were subtly different. For example Stream prints numbers in hexadecimal when using operator<< with an unsigned integer, raw_ostream always uses decimal.

Unless someone wants to take the job of making raw_ostream and StringStream more compatible I'd recommend using Stream instead of llvm::raw_ostream in LLDB. That's what most of LLDB's code is using already and what its interfaces are built on. If all you do is pasting together strings and numbers both classes can be used in almost the same way, naming differences aside.

- Ben