Introduce StringBuilder utility around raw_string_ostream

Mon Jul 7 18:56:20 PDT 2014

On Fri, Jul 4, 2014 at 2:37 PM, Alp Toker <alp at nuanti.com> wrote:
>
> On 02/07/2014 18:16, Alp Toker wrote:
>>
>>
>> On 01/07/2014 00:24, Alp Toker wrote:
>>>
>>>
>>> On 27/06/2014 10:06, Yaron Keren wrote:
>>>>
>>>> The LLVM ADTs has several common "themes" or  preferences
>>>>
>>>> 1) Efficient by default, efforts to avoid or minimize heap allocation -
>>>> even at the cost of making debugging harder: pointers are not pointers but
>>>> packed with bits or shared with other pointers, thus not the correct type,
>>>> StringRef are not null terminated, etc. (We  have visualizers to help.)
>>>> 2) Easy to use by default, compatible with the STL containers, simply
>>>> work.
>>>> 3) Open to "advanced" customization of 1) and 2) via template arguments
>>>> if required.
>>>>
>>>> Wether these features should be introduced in additional
>>>> raw_string_ostream functionality or a new string_ostream may be the result
>>>> of what's technically possible such that it
>>>>
>>>> 1) Uses a non-heap container (SmallString) by default.
>>>> 2) Is able to use other containers (std::string?) for clients who
>>>> provide it.
>>>> 3) Easy to use = standard string/ostream semantics where possible.
>>>> 4) Hard to misuse = hard/impossible to get into invalid/unsynchronized
>>>> state, does not require repeating code patterns.
>>>>
>>>
>>> Hi Yaron,
>>>
>>> Thanks for articulating that. It's a great observation that the best
>>> approach for string handling will be a synthesis of ADT's common themes.
>>>
>>> As far as I can tell it's technically not possible to cover a1,2,3 or
>>> b1,3,4 by extending raw_string_ostream, while I'm pretty sure we've got it
>>> all covered with the proposed string_ostream, especially the fact that we're
>>> now converging raw_svector_ostream and raw_string_ostream uses around a
>>> single type.
>>>
>>> David, ping?
>>>
>>> Alp.
>>>
>>>
>>>> Yaron
>>>>
>>>>
>>>>
>>>> 2014-06-27 4:20 GMT+03:00 Alp Toker <alp at nuanti.com
>>>> <mailto:alp at nuanti.com>>:
>>>>
>>>>
>>>>     On 27/06/2014 02:50, David Blaikie wrote:
>>>>
>>>>         On Thu, Jun 26, 2014 at 4:05 PM, Alp Toker <alp at nuanti.com
>>>>         <mailto:alp at nuanti.com>> wrote:
>>>>
>>>>             On 27/06/2014 01:42, David Blaikie wrote:
>>>>
>>>>                 On Thu, Jun 26, 2014 at 3:37 PM, Alp Toker
>>>>                 <alp at nuanti.com <mailto:alp at nuanti.com>> wrote:
>>>>
>>>>                     On 27/06/2014 00:48, David Blaikie wrote:
>>>>
>>>>                         On Wed, Jun 25, 2014 at 9:53 PM, David Blaikie
>>>>                         <dblaikie at gmail.com <mailto:dblaikie at gmail.com>>
>>>>                         wrote:
>>>>
>>>>                             On Wed, Jun 25, 2014 at 5:52 PM, Alp Toker
>>>>                             <alp at nuanti.com <mailto:alp at nuanti.com>>
>>>>                             wrote:
>>>>
>>>>                                 On 26/06/2014 03:16, Chandler Carruth
>>>>                                 wrote:
>>>>
>>>>                                     On Thu, Jun 26, 2014 at 2:09 AM,
>>>>                                     Alp Toker <alp at nuanti.com
>>>> <mailto:alp at nuanti.com>
>>>>                                     <mailto:alp at nuanti.com
>>>> <mailto:alp at nuanti.com>>> wrote:
>>>>
>>>>                                            I've made that change, plus
>>>>                                     a couple more updated uses, and
>>>>                                     landed
>>>>                                            this in r211749.
>>>>
>>>>                             FWIW, perhaps I wasn't clear - but my "I'd
>>>>                             like to loop back around
>>>>                             with Chandler to nail down the design
>>>>                             choice" was "I'm not signing off
>>>>                             on this patch without Chandler", so the
>>>>                             commit may've been a bit
>>>>                             premature.
>>>>
>>>>                         Given this confusion, I think appropriate to
>>>>                         back this patch out until
>>>>                         it's actually been signed off on.
>>>>
>>>>                         Sorry for my vague language/lack of clarity
>>>>                         over incremental review
>>>>                         compared to sign-off.
>>>>
>>>>
>>>>                     I don't agree there was any problem with review
>>>>                     here. The patch got its
>>>>                     OKs
>>>>                     and looks greats,
>>>>
>>>>                 You asked me if it was OK to land. I gave some
>>>>                 feedback and said I'd
>>>>                 like to wait. You interpreted that as sign-off. I did
>>>>                 not intend it as
>>>>                 such. We seem to have failed to communicate and I'm
>>>>                 trying to clarify
>>>>                 what I intended to communicate.
>>>>
>>>>                 I'm sorry for the miscommunication, but I really
>>>>                 didn't mean to give
>>>>                 you the impression that this was ready to commit.
>>>>                 Please revert it.
>>>>
>>>>
>>>>             There's obviously no way to revert this without losing the
>>>>             bug fixes.
>>>>
>>>>         Right - they can all be included together in whatever form this
>>>> is
>>>>         recommitted. No worries about that. (happens all the time here
>>>>         - given
>>>>         we prefer fine grained commits (ideally so that sometimes they
>>>>         can be
>>>>         reverted separately, or at least examined separately when
>>>> looking
>>>>         through commit history, etc) so sometimes a set of related
>>>> patches
>>>>         have to be reverted together. Not a problem)
>>>>
>>>>             I've backed them out together temporarily in r211814.
>>>>
>>>>         Thanks!
>>>>
>>>>             Could you state any concerns you have clearly?
>>>>
>>>>         Sure - sorry, just didn't want to confuse code review protocol
>>>>         with
>>>>         the specific merits or caveats of this patch. Not even a
>>>>         question of
>>>>         whether I had concerns or not, just that I hadn't actually
>>>>         signed off
>>>>         on the patch. A simple misunderstanding. I'll try to be more
>>>>         clear in
>>>>         the future.
>>>>
>>>>             A critical feature of the patch is guaranteed safety.
>>>>             Where you see a
>>>>             "string_ostream" type you get an assurance that the string
>>>>             building
>>>>             operations are valid.
>>>>
>>>>         The difference here would just be "Where you see string_ostream
>>>>         default constructed, you get an assurance that the string
>>>> building
>>>>         operations are valid", right? (it's just "if you use this
>>>>         constructor
>>>>         that takes an externally-provided buffer, buyer beware" - tihs
>>>>         is true
>>>>         of many types (eg: std::unique_ptr's raw ctor, compared to
>>>>         something
>>>>         like make_unique)).
>>>>
>>>>
>>>>     What you're suggesting would be equivalent to having
>>>>     std::unique_ptr default-construct the pointee type which I'm
>>>>     pretty sure it doesn't do?
>>>>
>>>>     By having a separate type, we get to document the semantics
>>>>     clearly, and search by type. We also get to hide the flush()
>>>>     function directly via the type system.
>>>>
>>>>     In your scheme that would end up tucked away in the documentation:
>>>>
>>>>       // If you constructed with X, then behaviour is Y, otherwise
>>>>     behaviour is Z.
>>>>
>>>>     And even after documenting, I'm sure people would still feel
>>>>     compelled to flush() before str() occasionally. With the proposed
>>>>     patch the compiler issues a helpful diagnostic instead.
>>>>
>>>>     (Note that std::stringstream *does* take a dual approach, and
>>>>     suffers from consistency as a result -- 66 open question on '
>>>>     stringstream' on StackOverflow at time of writing is high for a
>>>>     single utility. This isn't really part of the standard C++ library
>>>>     we want to emulate in LLVM -- we're much more conscientious about
>>>>     our strings in general, and shouldn't drop the ball here.)
>>>>
>>>>
>>>>             Those assurances apply even when the builder is passed by
>>>>             reference. You can
>>>>             see the clarity this adds for example in
>>>>             LLVMCreateMessage() and a few other
>>>>             places on ToT.
>>>>
>>>>         The cleanup in LLVMCreateMessage and callers seems almost
>>>> entirely
>>>>         orthogonal to your new type. The same API change could be made
>>>>         even
>>>>         with the old code:
>>>>
>>>>            string s;
>>>>            raw_string_ostream stream(s);
>>>>            ...
>>>>            return LLVMCreateMessage(stream);
>>>>
>>>>
>>>>     Actually I don't think the form you suggest above particularly
>>>>     safe. I'm glad you raised this as I chose the example intentionally.
>>>>
>>>>     It's unconventional for a function to modify local variables in
>>>>     parent stack frames that weren't explicitly named in the argument
>>>>     list whichever way you bake it.
>>>>
>>>>     No other code really does this in the LLVM source tree, and
>>>>     although you and I immediately grok what's going on as C++
>>>>     frontend developers, it's actually subtle and easy to miss. The
>>>>     programming errors on ToT show just how easy it is to lose the
>>>>     plot with that pattern.
>>>>
>>>>     So my change here wasn't orthogonal -- in fact it's completely
>>>>     aligned with and dependent upon the new safety guarantees
>>>>     introduced in the patch. We absolutely need representation in the
>>>>     type system so that we can enforce correctness.
>>>>
>>>>
>>>>         LLVMCreateMessage is still safe, clear, and correct using
>>>>         either API
>>>>         because str(), even on raw_string_ostream as it is today, is
>>>> still
>>>>         well defined to flush and then give access to the string.
>>>>
>>>>         Or, with a single type, it'd just be:
>>>>
>>>>            raw_string_ostream stream;
>>>>
>>>>         instead of
>>>>
>>>>            string_ostream stream;
>>>>
>>>>         Or is there some other clarity benefit you were referring to?
>>>>
>>>>         (with the same LLVMCreateMessage change you made (though I
>>>>         personally
>>>>         would have LLVMCreateMessage take a const std::string& - it
>>>>         seems more
>>>>         appropriate for a variety of callers, even if the current
>>>>         callers all
>>>>         happen to have raw_string_ostreams at the moment - indeed,
>>>> perhaps
>>>>         some of them should just be using string concatenation anyway))
>>>>
>>>>             Importantly, there is no locality of construction so *the
>>>>             type becomes the
>>>>             documentation*.
>>>>
>>>>         Generally it's only important to know this distinction in the
>>>>         scope
>>>>         where the type is declared (because it's that scope where the
>>>>         external
>>>>         storage is provided) - once the stream is passed to some other
>>>>         API,
>>>>         the only access to the string is likely going to be through
>>>>         raw_string_ostream's "str()" member, which does the right thing.
>>>>
>>>>
>>>>     See above.
>>>>
>>>>
>>>>
>>>>             If we were to lump storage into raw_string_ostream, this
>>>>             strong safety
>>>>             indicated by the type system would disappear along with
>>>>             all the heap
>>>>             savings.
>>>>
>>>>         Heap savings can be reacquired where /really/ necessary by using
>>>>         external storage, if necessary. By building in storage, that
>>>> would
>>>>         become the exception rather than the rule, and something we
>>>>         could more
>>>>         closely scrutinize during code review.
>>>>
>>>>             (which simply aren't negligible compared to some of the
>>>>             effort going into
>>>>             bit packing elsewhere in the codebase).
>>>>
>>>>         Apples and oranges, I suspect (& wouldn't mind some evidence
>>>>         to the
>>>>         contrary before we proliferate more choices in data structure
>>>>         type on
>>>>         the codebase - we have a lot, many for good reason, but there is
>>>> a
>>>>         cognitive/code review/etc load we pay by adding more). The bit
>>>>         packing
>>>>         sort of stuff happens in data structures that there are often
>>>>         thousands upon thousands of in a single run, or at a single
>>>>         point in
>>>>         time so they have a significant impact on peak memory usage. I
>>>>         doubt
>>>>         we have more than a handful of raw_string_ostreams in existence
>>>>         simultaneously.
>>>>
>>>>
>>>>     I'd like numbers too, but from visual inspection we can already
>>>>     see that this introduces some worthwhile savings.
>>>>
>>>>     string_ostream basically just encourages and simplifies our best
>>>>     practice of using SmallString<> on the stack, and that's never
>>>>     been in doubt.
>>>>
>>>>     In the patch you can see we've fixed truncation issues in code
>>>>     that was previously storing in an external SmallString<>, not just
>>>>     std::string. Chandler's proposed raw_string_ostream wouldn't be a
>>>>     suitable replacement for those uses, missing about 50% of the
>>>>     potential fixes where we want to use this, particularly in clang
>>>>     and LLVM's IR facility. In my patch, we cover the lot with a
>>>>     single type.
>>>>
>>>>     I saw you discussing how to avoid heap-allocation of a *singleton*
>>>>     recently on this list so I know you care at least a little about
>>>>     this matter :-)
>>>>
>>>>
>>>>             If we mix semantics in the base, we'd basically just end
>>>>             up a raw string
>>>>             ostream that sometimes isn't raw,  even making the name of
>>>>             the type a
>>>>             little dubious.
>>>>
>>>>         (On naming: as Chandler mentioned, these are all raw streams -
>>>>         even
>>>>         the new ones you're proposing, the "raw"ness is intended to
>>>>         distinguish them from the C++ Standard stream libraries that
>>>> have
>>>>         streambufs and all that other stuff, not about whether they have
>>>>         internal storage or not)
>>>>
>>>>
>>>>     Bike shed. I'm happy with what's in the patch, but still open to
>>>>     better names.
>>>>
>>>>     Does this go some way to covering your remaining concerns?
>>
>>
>> Ping
>
>
> Re-ping.

Sorry this is taking some time to get to, been trying to find the
mental bandwidth and fortitude to address the above discussion.

> This is starting to block other work..

I don't quite see how. This may be a nice cleanup, but it doesn't seem
to fundamentally make anything possible that wasn't before, even when
it comes to cleanup it has a fairly limited impact of changing two
lines to one in a few places.

Which is the crux of my objection/concern here - it doesn't seem
worthwhile to add new types for that particular win.

I appreciate the benefit of safe types that you can see, at a glance,
are safe (hence my desire to move towards std::make_shared, for
example - and my desire to remove a pimpl you mentioned (not to save a
heap allocation as you characterized - it doesn't actually phase me
much, but the increased cognitive load/code complexity did, and adding
new types has the possibility of adding more of that)).

It seems we already have two types (raw_string_ostream and
raw_svector_ostream) and then we'll have two more (string_ostream and
svector_ostream - and I agree with Chandler here, the "raw" prefix
cannot be dropped (it's there to separate these streams from the
standard streams, since they're not cross compatible) nor used as a
distinguisher (it doesn't tell someone reading the code what's
different between these two kinds of types)).

I suspect the right thing might be a raw_string_ostream<T =
std::string>, we could typedef the svector case, or just write it out
explicitly depending on the number of instances (I suspect writing it
out, possibly using decltype, would be helpful - since I assume the
existing uses of raw_svector_ostream were due to existing small
vectors that had certain sizes).

And I'm still somewhat inclined towards optional ownership defined at
construction time - I don't really see a major benefit to type safety
far from the construction site.

If non-owning streams are as high-risk as they appear, we could
separate out their construction into a factory function so that it's
more explicit:

raw_string_ostream<> s; //owning

std::string buf;
raw_string_ostream<> non_owning =
make_non_owning_raw_string_ostream(buf); //non-owning

Though, at that point it could be a different type and just always
written as "auto non_owning" and no one would probably care. As long
as none of these types appear on any API interface. If we need a
"str()" function that provides the current buffer on the interface,
then I'd like that to be in /one/ place, not 4. It'll be really
unfortunate if we need that, have functions that take a
raw_owning_string_ostream because they want to access the string
buffer, but then that function can't be passed a
raw_non_owning_string_ostream, etc... I don't think that's a good API
design.

> I've gone ahead and fixed further issues currently existing on ToT as
> pointed out by Benjamin Kramer (small buffer sizes cause inefficiency with
> raw_svector_ostream).
>
> There are around 30 bug fixes for obviously flawed debug/IR dump/performance
> and truncation issues in this patch and I've endeavoured to fix more issues
> currently existing on ToT but really we should be fixing those as subsequent
> commits instead of trying to lump it all into this patch, because it's
> becoming unwieldy to maintain.

Those bugs can be (fairly easily) fixed without this patch, so far as
I understand it. So you can hold them on this patch or not, as it
suits.

> Are we good to go?
>
> Alp.
>
>
>
>>
>>
>>>>
>>>>     Alp.
>>>>
>>>>
>>>>
>>>>         - David
>>>>
>>>>
>>>>     --     http://www.nuanti.com
>>>>     the browser experts
>>>>
>>>>     _______________________________________________
>>>>     llvm-commits mailing list
>>>>     llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>>     http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>>
>>>
>>
>
> --
> http://www.nuanti.com
> the browser experts
>