Introduce StringBuilder utility around raw_string_ostream

Alp Toker alp at nuanti.com
Fri Jul 4 14:37:19 PDT 2014


On 02/07/2014 18:16, Alp Toker wrote:
>
> On 01/07/2014 00:24, Alp Toker wrote:
>>
>> On 27/06/2014 10:06, Yaron Keren wrote:
>>> The LLVM ADTs has several common "themes" or  preferences
>>>
>>> 1) Efficient by default, efforts to avoid or minimize heap 
>>> allocation - even at the cost of making debugging harder: pointers 
>>> are not pointers but packed with bits or shared with other pointers, 
>>> thus not the correct type, StringRef are not null terminated, etc. 
>>> (We  have visualizers to help.)
>>> 2) Easy to use by default, compatible with the STL containers, 
>>> simply work.
>>> 3) Open to "advanced" customization of 1) and 2) via template 
>>> arguments if required.
>>>
>>> Wether these features should be introduced in additional 
>>> raw_string_ostream functionality or a new string_ostream may be the 
>>> result of what's technically possible such that it
>>>
>>> 1) Uses a non-heap container (SmallString) by default.
>>> 2) Is able to use other containers (std::string?) for clients who 
>>> provide it.
>>> 3) Easy to use = standard string/ostream semantics where possible.
>>> 4) Hard to misuse = hard/impossible to get into 
>>> invalid/unsynchronized state, does not require repeating code patterns.
>>>
>>
>> Hi Yaron,
>>
>> Thanks for articulating that. It's a great observation that the best 
>> approach for string handling will be a synthesis of ADT's common themes.
>>
>> As far as I can tell it's technically not possible to cover a1,2,3 or 
>> b1,3,4 by extending raw_string_ostream, while I'm pretty sure we've 
>> got it all covered with the proposed string_ostream, especially the 
>> fact that we're now converging raw_svector_ostream and 
>> raw_string_ostream uses around a single type.
>>
>> David, ping?
>>
>> Alp.
>>
>>
>>> Yaron
>>>
>>>
>>>
>>> 2014-06-27 4:20 GMT+03:00 Alp Toker <alp at nuanti.com 
>>> <mailto:alp at nuanti.com>>:
>>>
>>>
>>>     On 27/06/2014 02:50, David Blaikie wrote:
>>>
>>>         On Thu, Jun 26, 2014 at 4:05 PM, Alp Toker <alp at nuanti.com
>>>         <mailto:alp at nuanti.com>> wrote:
>>>
>>>             On 27/06/2014 01:42, David Blaikie wrote:
>>>
>>>                 On Thu, Jun 26, 2014 at 3:37 PM, Alp Toker
>>>                 <alp at nuanti.com <mailto:alp at nuanti.com>> wrote:
>>>
>>>                     On 27/06/2014 00:48, David Blaikie wrote:
>>>
>>>                         On Wed, Jun 25, 2014 at 9:53 PM, David Blaikie
>>>                         <dblaikie at gmail.com 
>>> <mailto:dblaikie at gmail.com>>
>>>                         wrote:
>>>
>>>                             On Wed, Jun 25, 2014 at 5:52 PM, Alp Toker
>>>                             <alp at nuanti.com <mailto:alp at nuanti.com>>
>>>                             wrote:
>>>
>>>                                 On 26/06/2014 03:16, Chandler Carruth
>>>                                 wrote:
>>>
>>>                                     On Thu, Jun 26, 2014 at 2:09 AM,
>>>                                     Alp Toker <alp at nuanti.com
>>> <mailto:alp at nuanti.com>
>>>                                     <mailto:alp at nuanti.com
>>> <mailto:alp at nuanti.com>>> wrote:
>>>
>>>                                            I've made that change, plus
>>>                                     a couple more updated uses, and
>>>                                     landed
>>>                                            this in r211749.
>>>
>>>                             FWIW, perhaps I wasn't clear - but my "I'd
>>>                             like to loop back around
>>>                             with Chandler to nail down the design
>>>                             choice" was "I'm not signing off
>>>                             on this patch without Chandler", so the
>>>                             commit may've been a bit
>>>                             premature.
>>>
>>>                         Given this confusion, I think appropriate to
>>>                         back this patch out until
>>>                         it's actually been signed off on.
>>>
>>>                         Sorry for my vague language/lack of clarity
>>>                         over incremental review
>>>                         compared to sign-off.
>>>
>>>
>>>                     I don't agree there was any problem with review
>>>                     here. The patch got its
>>>                     OKs
>>>                     and looks greats,
>>>
>>>                 You asked me if it was OK to land. I gave some
>>>                 feedback and said I'd
>>>                 like to wait. You interpreted that as sign-off. I did
>>>                 not intend it as
>>>                 such. We seem to have failed to communicate and I'm
>>>                 trying to clarify
>>>                 what I intended to communicate.
>>>
>>>                 I'm sorry for the miscommunication, but I really
>>>                 didn't mean to give
>>>                 you the impression that this was ready to commit.
>>>                 Please revert it.
>>>
>>>
>>>             There's obviously no way to revert this without losing the
>>>             bug fixes.
>>>
>>>         Right - they can all be included together in whatever form 
>>> this is
>>>         recommitted. No worries about that. (happens all the time here
>>>         - given
>>>         we prefer fine grained commits (ideally so that sometimes they
>>>         can be
>>>         reverted separately, or at least examined separately when 
>>> looking
>>>         through commit history, etc) so sometimes a set of related 
>>> patches
>>>         have to be reverted together. Not a problem)
>>>
>>>             I've backed them out together temporarily in r211814.
>>>
>>>         Thanks!
>>>
>>>             Could you state any concerns you have clearly?
>>>
>>>         Sure - sorry, just didn't want to confuse code review protocol
>>>         with
>>>         the specific merits or caveats of this patch. Not even a
>>>         question of
>>>         whether I had concerns or not, just that I hadn't actually
>>>         signed off
>>>         on the patch. A simple misunderstanding. I'll try to be more
>>>         clear in
>>>         the future.
>>>
>>>             A critical feature of the patch is guaranteed safety.
>>>             Where you see a
>>>             "string_ostream" type you get an assurance that the string
>>>             building
>>>             operations are valid.
>>>
>>>         The difference here would just be "Where you see string_ostream
>>>         default constructed, you get an assurance that the string 
>>> building
>>>         operations are valid", right? (it's just "if you use this
>>>         constructor
>>>         that takes an externally-provided buffer, buyer beware" - tihs
>>>         is true
>>>         of many types (eg: std::unique_ptr's raw ctor, compared to
>>>         something
>>>         like make_unique)).
>>>
>>>
>>>     What you're suggesting would be equivalent to having
>>>     std::unique_ptr default-construct the pointee type which I'm
>>>     pretty sure it doesn't do?
>>>
>>>     By having a separate type, we get to document the semantics
>>>     clearly, and search by type. We also get to hide the flush()
>>>     function directly via the type system.
>>>
>>>     In your scheme that would end up tucked away in the documentation:
>>>
>>>       // If you constructed with X, then behaviour is Y, otherwise
>>>     behaviour is Z.
>>>
>>>     And even after documenting, I'm sure people would still feel
>>>     compelled to flush() before str() occasionally. With the proposed
>>>     patch the compiler issues a helpful diagnostic instead.
>>>
>>>     (Note that std::stringstream *does* take a dual approach, and
>>>     suffers from consistency as a result -- 66 open question on '
>>>     stringstream' on StackOverflow at time of writing is high for a
>>>     single utility. This isn't really part of the standard C++ library
>>>     we want to emulate in LLVM -- we're much more conscientious about
>>>     our strings in general, and shouldn't drop the ball here.)
>>>
>>>
>>>             Those assurances apply even when the builder is passed by
>>>             reference. You can
>>>             see the clarity this adds for example in
>>>             LLVMCreateMessage() and a few other
>>>             places on ToT.
>>>
>>>         The cleanup in LLVMCreateMessage and callers seems almost 
>>> entirely
>>>         orthogonal to your new type. The same API change could be made
>>>         even
>>>         with the old code:
>>>
>>>            string s;
>>>            raw_string_ostream stream(s);
>>>            ...
>>>            return LLVMCreateMessage(stream);
>>>
>>>
>>>     Actually I don't think the form you suggest above particularly
>>>     safe. I'm glad you raised this as I chose the example 
>>> intentionally.
>>>
>>>     It's unconventional for a function to modify local variables in
>>>     parent stack frames that weren't explicitly named in the argument
>>>     list whichever way you bake it.
>>>
>>>     No other code really does this in the LLVM source tree, and
>>>     although you and I immediately grok what's going on as C++
>>>     frontend developers, it's actually subtle and easy to miss. The
>>>     programming errors on ToT show just how easy it is to lose the
>>>     plot with that pattern.
>>>
>>>     So my change here wasn't orthogonal -- in fact it's completely
>>>     aligned with and dependent upon the new safety guarantees
>>>     introduced in the patch. We absolutely need representation in the
>>>     type system so that we can enforce correctness.
>>>
>>>
>>>         LLVMCreateMessage is still safe, clear, and correct using
>>>         either API
>>>         because str(), even on raw_string_ostream as it is today, is 
>>> still
>>>         well defined to flush and then give access to the string.
>>>
>>>         Or, with a single type, it'd just be:
>>>
>>>            raw_string_ostream stream;
>>>
>>>         instead of
>>>
>>>            string_ostream stream;
>>>
>>>         Or is there some other clarity benefit you were referring to?
>>>
>>>         (with the same LLVMCreateMessage change you made (though I
>>>         personally
>>>         would have LLVMCreateMessage take a const std::string& - it
>>>         seems more
>>>         appropriate for a variety of callers, even if the current
>>>         callers all
>>>         happen to have raw_string_ostreams at the moment - indeed, 
>>> perhaps
>>>         some of them should just be using string concatenation anyway))
>>>
>>>             Importantly, there is no locality of construction so *the
>>>             type becomes the
>>>             documentation*.
>>>
>>>         Generally it's only important to know this distinction in the
>>>         scope
>>>         where the type is declared (because it's that scope where the
>>>         external
>>>         storage is provided) - once the stream is passed to some other
>>>         API,
>>>         the only access to the string is likely going to be through
>>>         raw_string_ostream's "str()" member, which does the right 
>>> thing.
>>>
>>>
>>>     See above.
>>>
>>>
>>>
>>>             If we were to lump storage into raw_string_ostream, this
>>>             strong safety
>>>             indicated by the type system would disappear along with
>>>             all the heap
>>>             savings.
>>>
>>>         Heap savings can be reacquired where /really/ necessary by 
>>> using
>>>         external storage, if necessary. By building in storage, that 
>>> would
>>>         become the exception rather than the rule, and something we
>>>         could more
>>>         closely scrutinize during code review.
>>>
>>>             (which simply aren't negligible compared to some of the
>>>             effort going into
>>>             bit packing elsewhere in the codebase).
>>>
>>>         Apples and oranges, I suspect (& wouldn't mind some evidence
>>>         to the
>>>         contrary before we proliferate more choices in data structure
>>>         type on
>>>         the codebase - we have a lot, many for good reason, but 
>>> there is a
>>>         cognitive/code review/etc load we pay by adding more). The bit
>>>         packing
>>>         sort of stuff happens in data structures that there are often
>>>         thousands upon thousands of in a single run, or at a single
>>>         point in
>>>         time so they have a significant impact on peak memory usage. I
>>>         doubt
>>>         we have more than a handful of raw_string_ostreams in existence
>>>         simultaneously.
>>>
>>>
>>>     I'd like numbers too, but from visual inspection we can already
>>>     see that this introduces some worthwhile savings.
>>>
>>>     string_ostream basically just encourages and simplifies our best
>>>     practice of using SmallString<> on the stack, and that's never
>>>     been in doubt.
>>>
>>>     In the patch you can see we've fixed truncation issues in code
>>>     that was previously storing in an external SmallString<>, not just
>>>     std::string. Chandler's proposed raw_string_ostream wouldn't be a
>>>     suitable replacement for those uses, missing about 50% of the
>>>     potential fixes where we want to use this, particularly in clang
>>>     and LLVM's IR facility. In my patch, we cover the lot with a
>>>     single type.
>>>
>>>     I saw you discussing how to avoid heap-allocation of a *singleton*
>>>     recently on this list so I know you care at least a little about
>>>     this matter :-)
>>>
>>>
>>>             If we mix semantics in the base, we'd basically just end
>>>             up a raw string
>>>             ostream that sometimes isn't raw,  even making the name of
>>>             the type a
>>>             little dubious.
>>>
>>>         (On naming: as Chandler mentioned, these are all raw streams -
>>>         even
>>>         the new ones you're proposing, the "raw"ness is intended to
>>>         distinguish them from the C++ Standard stream libraries that 
>>> have
>>>         streambufs and all that other stuff, not about whether they 
>>> have
>>>         internal storage or not)
>>>
>>>
>>>     Bike shed. I'm happy with what's in the patch, but still open to
>>>     better names.
>>>
>>>     Does this go some way to covering your remaining concerns?
>
> Ping

Re-ping.

This is starting to block other work..

I've gone ahead and fixed further issues currently existing on ToT as 
pointed out by Benjamin Kramer (small buffer sizes cause inefficiency 
with raw_svector_ostream).

There are around 30 bug fixes for obviously flawed debug/IR 
dump/performance and truncation issues in this patch and I've 
endeavoured to fix more issues currently existing on ToT but really we 
should be fixing those as subsequent commits instead of trying to lump 
it all into this patch, because it's becoming unwieldy to maintain.

Are we good to go?

Alp.


>
>
>>>
>>>     Alp.
>>>
>>>
>>>
>>>         - David
>>>
>>>
>>>     --     http://www.nuanti.com
>>>     the browser experts
>>>
>>>     _______________________________________________
>>>     llvm-commits mailing list
>>>     llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>     http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>
>

-- 
http://www.nuanti.com
the browser experts




More information about the llvm-commits mailing list