Introduce StringBuilder utility around raw_string_ostream
David Blaikie
dblaikie at gmail.com
Mon Jul 7 18:56:20 PDT 2014
On Fri, Jul 4, 2014 at 2:37 PM, Alp Toker <alp at nuanti.com> wrote:
>
> On 02/07/2014 18:16, Alp Toker wrote:
>>
>>
>> On 01/07/2014 00:24, Alp Toker wrote:
>>>
>>>
>>> On 27/06/2014 10:06, Yaron Keren wrote:
>>>>
>>>> The LLVM ADTs has several common "themes" or preferences
>>>>
>>>> 1) Efficient by default, efforts to avoid or minimize heap allocation -
>>>> even at the cost of making debugging harder: pointers are not pointers but
>>>> packed with bits or shared with other pointers, thus not the correct type,
>>>> StringRef are not null terminated, etc. (We have visualizers to help.)
>>>> 2) Easy to use by default, compatible with the STL containers, simply
>>>> work.
>>>> 3) Open to "advanced" customization of 1) and 2) via template arguments
>>>> if required.
>>>>
>>>> Wether these features should be introduced in additional
>>>> raw_string_ostream functionality or a new string_ostream may be the result
>>>> of what's technically possible such that it
>>>>
>>>> 1) Uses a non-heap container (SmallString) by default.
>>>> 2) Is able to use other containers (std::string?) for clients who
>>>> provide it.
>>>> 3) Easy to use = standard string/ostream semantics where possible.
>>>> 4) Hard to misuse = hard/impossible to get into invalid/unsynchronized
>>>> state, does not require repeating code patterns.
>>>>
>>>
>>> Hi Yaron,
>>>
>>> Thanks for articulating that. It's a great observation that the best
>>> approach for string handling will be a synthesis of ADT's common themes.
>>>
>>> As far as I can tell it's technically not possible to cover a1,2,3 or
>>> b1,3,4 by extending raw_string_ostream, while I'm pretty sure we've got it
>>> all covered with the proposed string_ostream, especially the fact that we're
>>> now converging raw_svector_ostream and raw_string_ostream uses around a
>>> single type.
>>>
>>> David, ping?
>>>
>>> Alp.
>>>
>>>
>>>> Yaron
>>>>
>>>>
>>>>
>>>> 2014-06-27 4:20 GMT+03:00 Alp Toker <alp at nuanti.com
>>>> <mailto:alp at nuanti.com>>:
>>>>
>>>>
>>>> On 27/06/2014 02:50, David Blaikie wrote:
>>>>
>>>> On Thu, Jun 26, 2014 at 4:05 PM, Alp Toker <alp at nuanti.com
>>>> <mailto:alp at nuanti.com>> wrote:
>>>>
>>>> On 27/06/2014 01:42, David Blaikie wrote:
>>>>
>>>> On Thu, Jun 26, 2014 at 3:37 PM, Alp Toker
>>>> <alp at nuanti.com <mailto:alp at nuanti.com>> wrote:
>>>>
>>>> On 27/06/2014 00:48, David Blaikie wrote:
>>>>
>>>> On Wed, Jun 25, 2014 at 9:53 PM, David Blaikie
>>>> <dblaikie at gmail.com <mailto:dblaikie at gmail.com>>
>>>> wrote:
>>>>
>>>> On Wed, Jun 25, 2014 at 5:52 PM, Alp Toker
>>>> <alp at nuanti.com <mailto:alp at nuanti.com>>
>>>> wrote:
>>>>
>>>> On 26/06/2014 03:16, Chandler Carruth
>>>> wrote:
>>>>
>>>> On Thu, Jun 26, 2014 at 2:09 AM,
>>>> Alp Toker <alp at nuanti.com
>>>> <mailto:alp at nuanti.com>
>>>> <mailto:alp at nuanti.com
>>>> <mailto:alp at nuanti.com>>> wrote:
>>>>
>>>> I've made that change, plus
>>>> a couple more updated uses, and
>>>> landed
>>>> this in r211749.
>>>>
>>>> FWIW, perhaps I wasn't clear - but my "I'd
>>>> like to loop back around
>>>> with Chandler to nail down the design
>>>> choice" was "I'm not signing off
>>>> on this patch without Chandler", so the
>>>> commit may've been a bit
>>>> premature.
>>>>
>>>> Given this confusion, I think appropriate to
>>>> back this patch out until
>>>> it's actually been signed off on.
>>>>
>>>> Sorry for my vague language/lack of clarity
>>>> over incremental review
>>>> compared to sign-off.
>>>>
>>>>
>>>> I don't agree there was any problem with review
>>>> here. The patch got its
>>>> OKs
>>>> and looks greats,
>>>>
>>>> You asked me if it was OK to land. I gave some
>>>> feedback and said I'd
>>>> like to wait. You interpreted that as sign-off. I did
>>>> not intend it as
>>>> such. We seem to have failed to communicate and I'm
>>>> trying to clarify
>>>> what I intended to communicate.
>>>>
>>>> I'm sorry for the miscommunication, but I really
>>>> didn't mean to give
>>>> you the impression that this was ready to commit.
>>>> Please revert it.
>>>>
>>>>
>>>> There's obviously no way to revert this without losing the
>>>> bug fixes.
>>>>
>>>> Right - they can all be included together in whatever form this
>>>> is
>>>> recommitted. No worries about that. (happens all the time here
>>>> - given
>>>> we prefer fine grained commits (ideally so that sometimes they
>>>> can be
>>>> reverted separately, or at least examined separately when
>>>> looking
>>>> through commit history, etc) so sometimes a set of related
>>>> patches
>>>> have to be reverted together. Not a problem)
>>>>
>>>> I've backed them out together temporarily in r211814.
>>>>
>>>> Thanks!
>>>>
>>>> Could you state any concerns you have clearly?
>>>>
>>>> Sure - sorry, just didn't want to confuse code review protocol
>>>> with
>>>> the specific merits or caveats of this patch. Not even a
>>>> question of
>>>> whether I had concerns or not, just that I hadn't actually
>>>> signed off
>>>> on the patch. A simple misunderstanding. I'll try to be more
>>>> clear in
>>>> the future.
>>>>
>>>> A critical feature of the patch is guaranteed safety.
>>>> Where you see a
>>>> "string_ostream" type you get an assurance that the string
>>>> building
>>>> operations are valid.
>>>>
>>>> The difference here would just be "Where you see string_ostream
>>>> default constructed, you get an assurance that the string
>>>> building
>>>> operations are valid", right? (it's just "if you use this
>>>> constructor
>>>> that takes an externally-provided buffer, buyer beware" - tihs
>>>> is true
>>>> of many types (eg: std::unique_ptr's raw ctor, compared to
>>>> something
>>>> like make_unique)).
>>>>
>>>>
>>>> What you're suggesting would be equivalent to having
>>>> std::unique_ptr default-construct the pointee type which I'm
>>>> pretty sure it doesn't do?
>>>>
>>>> By having a separate type, we get to document the semantics
>>>> clearly, and search by type. We also get to hide the flush()
>>>> function directly via the type system.
>>>>
>>>> In your scheme that would end up tucked away in the documentation:
>>>>
>>>> // If you constructed with X, then behaviour is Y, otherwise
>>>> behaviour is Z.
>>>>
>>>> And even after documenting, I'm sure people would still feel
>>>> compelled to flush() before str() occasionally. With the proposed
>>>> patch the compiler issues a helpful diagnostic instead.
>>>>
>>>> (Note that std::stringstream *does* take a dual approach, and
>>>> suffers from consistency as a result -- 66 open question on '
>>>> stringstream' on StackOverflow at time of writing is high for a
>>>> single utility. This isn't really part of the standard C++ library
>>>> we want to emulate in LLVM -- we're much more conscientious about
>>>> our strings in general, and shouldn't drop the ball here.)
>>>>
>>>>
>>>> Those assurances apply even when the builder is passed by
>>>> reference. You can
>>>> see the clarity this adds for example in
>>>> LLVMCreateMessage() and a few other
>>>> places on ToT.
>>>>
>>>> The cleanup in LLVMCreateMessage and callers seems almost
>>>> entirely
>>>> orthogonal to your new type. The same API change could be made
>>>> even
>>>> with the old code:
>>>>
>>>> string s;
>>>> raw_string_ostream stream(s);
>>>> ...
>>>> return LLVMCreateMessage(stream);
>>>>
>>>>
>>>> Actually I don't think the form you suggest above particularly
>>>> safe. I'm glad you raised this as I chose the example intentionally.
>>>>
>>>> It's unconventional for a function to modify local variables in
>>>> parent stack frames that weren't explicitly named in the argument
>>>> list whichever way you bake it.
>>>>
>>>> No other code really does this in the LLVM source tree, and
>>>> although you and I immediately grok what's going on as C++
>>>> frontend developers, it's actually subtle and easy to miss. The
>>>> programming errors on ToT show just how easy it is to lose the
>>>> plot with that pattern.
>>>>
>>>> So my change here wasn't orthogonal -- in fact it's completely
>>>> aligned with and dependent upon the new safety guarantees
>>>> introduced in the patch. We absolutely need representation in the
>>>> type system so that we can enforce correctness.
>>>>
>>>>
>>>> LLVMCreateMessage is still safe, clear, and correct using
>>>> either API
>>>> because str(), even on raw_string_ostream as it is today, is
>>>> still
>>>> well defined to flush and then give access to the string.
>>>>
>>>> Or, with a single type, it'd just be:
>>>>
>>>> raw_string_ostream stream;
>>>>
>>>> instead of
>>>>
>>>> string_ostream stream;
>>>>
>>>> Or is there some other clarity benefit you were referring to?
>>>>
>>>> (with the same LLVMCreateMessage change you made (though I
>>>> personally
>>>> would have LLVMCreateMessage take a const std::string& - it
>>>> seems more
>>>> appropriate for a variety of callers, even if the current
>>>> callers all
>>>> happen to have raw_string_ostreams at the moment - indeed,
>>>> perhaps
>>>> some of them should just be using string concatenation anyway))
>>>>
>>>> Importantly, there is no locality of construction so *the
>>>> type becomes the
>>>> documentation*.
>>>>
>>>> Generally it's only important to know this distinction in the
>>>> scope
>>>> where the type is declared (because it's that scope where the
>>>> external
>>>> storage is provided) - once the stream is passed to some other
>>>> API,
>>>> the only access to the string is likely going to be through
>>>> raw_string_ostream's "str()" member, which does the right thing.
>>>>
>>>>
>>>> See above.
>>>>
>>>>
>>>>
>>>> If we were to lump storage into raw_string_ostream, this
>>>> strong safety
>>>> indicated by the type system would disappear along with
>>>> all the heap
>>>> savings.
>>>>
>>>> Heap savings can be reacquired where /really/ necessary by using
>>>> external storage, if necessary. By building in storage, that
>>>> would
>>>> become the exception rather than the rule, and something we
>>>> could more
>>>> closely scrutinize during code review.
>>>>
>>>> (which simply aren't negligible compared to some of the
>>>> effort going into
>>>> bit packing elsewhere in the codebase).
>>>>
>>>> Apples and oranges, I suspect (& wouldn't mind some evidence
>>>> to the
>>>> contrary before we proliferate more choices in data structure
>>>> type on
>>>> the codebase - we have a lot, many for good reason, but there is
>>>> a
>>>> cognitive/code review/etc load we pay by adding more). The bit
>>>> packing
>>>> sort of stuff happens in data structures that there are often
>>>> thousands upon thousands of in a single run, or at a single
>>>> point in
>>>> time so they have a significant impact on peak memory usage. I
>>>> doubt
>>>> we have more than a handful of raw_string_ostreams in existence
>>>> simultaneously.
>>>>
>>>>
>>>> I'd like numbers too, but from visual inspection we can already
>>>> see that this introduces some worthwhile savings.
>>>>
>>>> string_ostream basically just encourages and simplifies our best
>>>> practice of using SmallString<> on the stack, and that's never
>>>> been in doubt.
>>>>
>>>> In the patch you can see we've fixed truncation issues in code
>>>> that was previously storing in an external SmallString<>, not just
>>>> std::string. Chandler's proposed raw_string_ostream wouldn't be a
>>>> suitable replacement for those uses, missing about 50% of the
>>>> potential fixes where we want to use this, particularly in clang
>>>> and LLVM's IR facility. In my patch, we cover the lot with a
>>>> single type.
>>>>
>>>> I saw you discussing how to avoid heap-allocation of a *singleton*
>>>> recently on this list so I know you care at least a little about
>>>> this matter :-)
>>>>
>>>>
>>>> If we mix semantics in the base, we'd basically just end
>>>> up a raw string
>>>> ostream that sometimes isn't raw, even making the name of
>>>> the type a
>>>> little dubious.
>>>>
>>>> (On naming: as Chandler mentioned, these are all raw streams -
>>>> even
>>>> the new ones you're proposing, the "raw"ness is intended to
>>>> distinguish them from the C++ Standard stream libraries that
>>>> have
>>>> streambufs and all that other stuff, not about whether they have
>>>> internal storage or not)
>>>>
>>>>
>>>> Bike shed. I'm happy with what's in the patch, but still open to
>>>> better names.
>>>>
>>>> Does this go some way to covering your remaining concerns?
>>
>>
>> Ping
>
>
> Re-ping.
Sorry this is taking some time to get to, been trying to find the
mental bandwidth and fortitude to address the above discussion.
> This is starting to block other work..
I don't quite see how. This may be a nice cleanup, but it doesn't seem
to fundamentally make anything possible that wasn't before, even when
it comes to cleanup it has a fairly limited impact of changing two
lines to one in a few places.
Which is the crux of my objection/concern here - it doesn't seem
worthwhile to add new types for that particular win.
I appreciate the benefit of safe types that you can see, at a glance,
are safe (hence my desire to move towards std::make_shared, for
example - and my desire to remove a pimpl you mentioned (not to save a
heap allocation as you characterized - it doesn't actually phase me
much, but the increased cognitive load/code complexity did, and adding
new types has the possibility of adding more of that)).
It seems we already have two types (raw_string_ostream and
raw_svector_ostream) and then we'll have two more (string_ostream and
svector_ostream - and I agree with Chandler here, the "raw" prefix
cannot be dropped (it's there to separate these streams from the
standard streams, since they're not cross compatible) nor used as a
distinguisher (it doesn't tell someone reading the code what's
different between these two kinds of types)).
I suspect the right thing might be a raw_string_ostream<T =
std::string>, we could typedef the svector case, or just write it out
explicitly depending on the number of instances (I suspect writing it
out, possibly using decltype, would be helpful - since I assume the
existing uses of raw_svector_ostream were due to existing small
vectors that had certain sizes).
And I'm still somewhat inclined towards optional ownership defined at
construction time - I don't really see a major benefit to type safety
far from the construction site.
If non-owning streams are as high-risk as they appear, we could
separate out their construction into a factory function so that it's
more explicit:
raw_string_ostream<> s; //owning
std::string buf;
raw_string_ostream<> non_owning =
make_non_owning_raw_string_ostream(buf); //non-owning
Though, at that point it could be a different type and just always
written as "auto non_owning" and no one would probably care. As long
as none of these types appear on any API interface. If we need a
"str()" function that provides the current buffer on the interface,
then I'd like that to be in /one/ place, not 4. It'll be really
unfortunate if we need that, have functions that take a
raw_owning_string_ostream because they want to access the string
buffer, but then that function can't be passed a
raw_non_owning_string_ostream, etc... I don't think that's a good API
design.
> I've gone ahead and fixed further issues currently existing on ToT as
> pointed out by Benjamin Kramer (small buffer sizes cause inefficiency with
> raw_svector_ostream).
>
> There are around 30 bug fixes for obviously flawed debug/IR dump/performance
> and truncation issues in this patch and I've endeavoured to fix more issues
> currently existing on ToT but really we should be fixing those as subsequent
> commits instead of trying to lump it all into this patch, because it's
> becoming unwieldy to maintain.
Those bugs can be (fairly easily) fixed without this patch, so far as
I understand it. So you can hold them on this patch or not, as it
suits.
> Are we good to go?
>
> Alp.
>
>
>
>>
>>
>>>>
>>>> Alp.
>>>>
>>>>
>>>>
>>>> - David
>>>>
>>>>
>>>> -- http://www.nuanti.com
>>>> the browser experts
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>>
>>>
>>
>
> --
> http://www.nuanti.com
> the browser experts
>
More information about the llvm-commits
mailing list