Introduce StringBuilder utility around raw_string_ostream

Alp Toker alp at nuanti.com
Mon Jul 7 21:57:06 PDT 2014


On 08/07/2014 04:56, David Blaikie wrote:
> On Fri, Jul 4, 2014 at 2:37 PM, Alp Toker <alp at nuanti.com> wrote:
>> On 02/07/2014 18:16, Alp Toker wrote:
>>>
>>> On 01/07/2014 00:24, Alp Toker wrote:
>>>>
>>>> On 27/06/2014 10:06, Yaron Keren wrote:
>>>>> The LLVM ADTs has several common "themes" or  preferences
>>>>>
>>>>> 1) Efficient by default, efforts to avoid or minimize heap allocation -
>>>>> even at the cost of making debugging harder: pointers are not pointers but
>>>>> packed with bits or shared with other pointers, thus not the correct type,
>>>>> StringRef are not null terminated, etc. (We  have visualizers to help.)
>>>>> 2) Easy to use by default, compatible with the STL containers, simply
>>>>> work.
>>>>> 3) Open to "advanced" customization of 1) and 2) via template arguments
>>>>> if required.
>>>>>
>>>>> Wether these features should be introduced in additional
>>>>> raw_string_ostream functionality or a new string_ostream may be the result
>>>>> of what's technically possible such that it
>>>>>
>>>>> 1) Uses a non-heap container (SmallString) by default.
>>>>> 2) Is able to use other containers (std::string?) for clients who
>>>>> provide it.
>>>>> 3) Easy to use = standard string/ostream semantics where possible.
>>>>> 4) Hard to misuse = hard/impossible to get into invalid/unsynchronized
>>>>> state, does not require repeating code patterns.
>>>>>
>>>> Hi Yaron,
>>>>
>>>> Thanks for articulating that. It's a great observation that the best
>>>> approach for string handling will be a synthesis of ADT's common themes.
>>>>
>>>> As far as I can tell it's technically not possible to cover a1,2,3 or
>>>> b1,3,4 by extending raw_string_ostream, while I'm pretty sure we've got it
>>>> all covered with the proposed string_ostream, especially the fact that we're
>>>> now converging raw_svector_ostream and raw_string_ostream uses around a
>>>> single type.
>>>>
>>>> David, ping?
>>>>
>>>> Alp.
>>>>
>>>>
>>>>> Yaron
>>>>>
>>>>>
>>>>>
>>>>> 2014-06-27 4:20 GMT+03:00 Alp Toker <alp at nuanti.com
>>>>> <mailto:alp at nuanti.com>>:
>>>>>
>>>>>
>>>>>      On 27/06/2014 02:50, David Blaikie wrote:
>>>>>
>>>>>          On Thu, Jun 26, 2014 at 4:05 PM, Alp Toker <alp at nuanti.com
>>>>>          <mailto:alp at nuanti.com>> wrote:
>>>>>
>>>>>              On 27/06/2014 01:42, David Blaikie wrote:
>>>>>
>>>>>                  On Thu, Jun 26, 2014 at 3:37 PM, Alp Toker
>>>>>                  <alp at nuanti.com <mailto:alp at nuanti.com>> wrote:
>>>>>
>>>>>                      On 27/06/2014 00:48, David Blaikie wrote:
>>>>>
>>>>>                          On Wed, Jun 25, 2014 at 9:53 PM, David Blaikie
>>>>>                          <dblaikie at gmail.com <mailto:dblaikie at gmail.com>>
>>>>>                          wrote:
>>>>>
>>>>>                              On Wed, Jun 25, 2014 at 5:52 PM, Alp Toker
>>>>>                              <alp at nuanti.com <mailto:alp at nuanti.com>>
>>>>>                              wrote:
>>>>>
>>>>>                                  On 26/06/2014 03:16, Chandler Carruth
>>>>>                                  wrote:
>>>>>
>>>>>                                      On Thu, Jun 26, 2014 at 2:09 AM,
>>>>>                                      Alp Toker <alp at nuanti.com
>>>>> <mailto:alp at nuanti.com>
>>>>>                                      <mailto:alp at nuanti.com
>>>>> <mailto:alp at nuanti.com>>> wrote:
>>>>>
>>>>>                                             I've made that change, plus
>>>>>                                      a couple more updated uses, and
>>>>>                                      landed
>>>>>                                             this in r211749.
>>>>>
>>>>>                              FWIW, perhaps I wasn't clear - but my "I'd
>>>>>                              like to loop back around
>>>>>                              with Chandler to nail down the design
>>>>>                              choice" was "I'm not signing off
>>>>>                              on this patch without Chandler", so the
>>>>>                              commit may've been a bit
>>>>>                              premature.
>>>>>
>>>>>                          Given this confusion, I think appropriate to
>>>>>                          back this patch out until
>>>>>                          it's actually been signed off on.
>>>>>
>>>>>                          Sorry for my vague language/lack of clarity
>>>>>                          over incremental review
>>>>>                          compared to sign-off.
>>>>>
>>>>>
>>>>>                      I don't agree there was any problem with review
>>>>>                      here. The patch got its
>>>>>                      OKs
>>>>>                      and looks greats,
>>>>>
>>>>>                  You asked me if it was OK to land. I gave some
>>>>>                  feedback and said I'd
>>>>>                  like to wait. You interpreted that as sign-off. I did
>>>>>                  not intend it as
>>>>>                  such. We seem to have failed to communicate and I'm
>>>>>                  trying to clarify
>>>>>                  what I intended to communicate.
>>>>>
>>>>>                  I'm sorry for the miscommunication, but I really
>>>>>                  didn't mean to give
>>>>>                  you the impression that this was ready to commit.
>>>>>                  Please revert it.
>>>>>
>>>>>
>>>>>              There's obviously no way to revert this without losing the
>>>>>              bug fixes.
>>>>>
>>>>>          Right - they can all be included together in whatever form this
>>>>> is
>>>>>          recommitted. No worries about that. (happens all the time here
>>>>>          - given
>>>>>          we prefer fine grained commits (ideally so that sometimes they
>>>>>          can be
>>>>>          reverted separately, or at least examined separately when
>>>>> looking
>>>>>          through commit history, etc) so sometimes a set of related
>>>>> patches
>>>>>          have to be reverted together. Not a problem)
>>>>>
>>>>>              I've backed them out together temporarily in r211814.
>>>>>
>>>>>          Thanks!
>>>>>
>>>>>              Could you state any concerns you have clearly?
>>>>>
>>>>>          Sure - sorry, just didn't want to confuse code review protocol
>>>>>          with
>>>>>          the specific merits or caveats of this patch. Not even a
>>>>>          question of
>>>>>          whether I had concerns or not, just that I hadn't actually
>>>>>          signed off
>>>>>          on the patch. A simple misunderstanding. I'll try to be more
>>>>>          clear in
>>>>>          the future.
>>>>>
>>>>>              A critical feature of the patch is guaranteed safety.
>>>>>              Where you see a
>>>>>              "string_ostream" type you get an assurance that the string
>>>>>              building
>>>>>              operations are valid.
>>>>>
>>>>>          The difference here would just be "Where you see string_ostream
>>>>>          default constructed, you get an assurance that the string
>>>>> building
>>>>>          operations are valid", right? (it's just "if you use this
>>>>>          constructor
>>>>>          that takes an externally-provided buffer, buyer beware" - tihs
>>>>>          is true
>>>>>          of many types (eg: std::unique_ptr's raw ctor, compared to
>>>>>          something
>>>>>          like make_unique)).
>>>>>
>>>>>
>>>>>      What you're suggesting would be equivalent to having
>>>>>      std::unique_ptr default-construct the pointee type which I'm
>>>>>      pretty sure it doesn't do?
>>>>>
>>>>>      By having a separate type, we get to document the semantics
>>>>>      clearly, and search by type. We also get to hide the flush()
>>>>>      function directly via the type system.
>>>>>
>>>>>      In your scheme that would end up tucked away in the documentation:
>>>>>
>>>>>        // If you constructed with X, then behaviour is Y, otherwise
>>>>>      behaviour is Z.
>>>>>
>>>>>      And even after documenting, I'm sure people would still feel
>>>>>      compelled to flush() before str() occasionally. With the proposed
>>>>>      patch the compiler issues a helpful diagnostic instead.
>>>>>
>>>>>      (Note that std::stringstream *does* take a dual approach, and
>>>>>      suffers from consistency as a result -- 66 open question on '
>>>>>      stringstream' on StackOverflow at time of writing is high for a
>>>>>      single utility. This isn't really part of the standard C++ library
>>>>>      we want to emulate in LLVM -- we're much more conscientious about
>>>>>      our strings in general, and shouldn't drop the ball here.)
>>>>>
>>>>>
>>>>>              Those assurances apply even when the builder is passed by
>>>>>              reference. You can
>>>>>              see the clarity this adds for example in
>>>>>              LLVMCreateMessage() and a few other
>>>>>              places on ToT.
>>>>>
>>>>>          The cleanup in LLVMCreateMessage and callers seems almost
>>>>> entirely
>>>>>          orthogonal to your new type. The same API change could be made
>>>>>          even
>>>>>          with the old code:
>>>>>
>>>>>             string s;
>>>>>             raw_string_ostream stream(s);
>>>>>             ...
>>>>>             return LLVMCreateMessage(stream);
>>>>>
>>>>>
>>>>>      Actually I don't think the form you suggest above particularly
>>>>>      safe. I'm glad you raised this as I chose the example intentionally.
>>>>>
>>>>>      It's unconventional for a function to modify local variables in
>>>>>      parent stack frames that weren't explicitly named in the argument
>>>>>      list whichever way you bake it.
>>>>>
>>>>>      No other code really does this in the LLVM source tree, and
>>>>>      although you and I immediately grok what's going on as C++
>>>>>      frontend developers, it's actually subtle and easy to miss. The
>>>>>      programming errors on ToT show just how easy it is to lose the
>>>>>      plot with that pattern.
>>>>>
>>>>>      So my change here wasn't orthogonal -- in fact it's completely
>>>>>      aligned with and dependent upon the new safety guarantees
>>>>>      introduced in the patch. We absolutely need representation in the
>>>>>      type system so that we can enforce correctness.
>>>>>
>>>>>
>>>>>          LLVMCreateMessage is still safe, clear, and correct using
>>>>>          either API
>>>>>          because str(), even on raw_string_ostream as it is today, is
>>>>> still
>>>>>          well defined to flush and then give access to the string.
>>>>>
>>>>>          Or, with a single type, it'd just be:
>>>>>
>>>>>             raw_string_ostream stream;
>>>>>
>>>>>          instead of
>>>>>
>>>>>             string_ostream stream;
>>>>>
>>>>>          Or is there some other clarity benefit you were referring to?
>>>>>
>>>>>          (with the same LLVMCreateMessage change you made (though I
>>>>>          personally
>>>>>          would have LLVMCreateMessage take a const std::string& - it
>>>>>          seems more
>>>>>          appropriate for a variety of callers, even if the current
>>>>>          callers all
>>>>>          happen to have raw_string_ostreams at the moment - indeed,
>>>>> perhaps
>>>>>          some of them should just be using string concatenation anyway))
>>>>>
>>>>>              Importantly, there is no locality of construction so *the
>>>>>              type becomes the
>>>>>              documentation*.
>>>>>
>>>>>          Generally it's only important to know this distinction in the
>>>>>          scope
>>>>>          where the type is declared (because it's that scope where the
>>>>>          external
>>>>>          storage is provided) - once the stream is passed to some other
>>>>>          API,
>>>>>          the only access to the string is likely going to be through
>>>>>          raw_string_ostream's "str()" member, which does the right thing.
>>>>>
>>>>>
>>>>>      See above.
>>>>>
>>>>>
>>>>>
>>>>>              If we were to lump storage into raw_string_ostream, this
>>>>>              strong safety
>>>>>              indicated by the type system would disappear along with
>>>>>              all the heap
>>>>>              savings.
>>>>>
>>>>>          Heap savings can be reacquired where /really/ necessary by using
>>>>>          external storage, if necessary. By building in storage, that
>>>>> would
>>>>>          become the exception rather than the rule, and something we
>>>>>          could more
>>>>>          closely scrutinize during code review.
>>>>>
>>>>>              (which simply aren't negligible compared to some of the
>>>>>              effort going into
>>>>>              bit packing elsewhere in the codebase).
>>>>>
>>>>>          Apples and oranges, I suspect (& wouldn't mind some evidence
>>>>>          to the
>>>>>          contrary before we proliferate more choices in data structure
>>>>>          type on
>>>>>          the codebase - we have a lot, many for good reason, but there is
>>>>> a
>>>>>          cognitive/code review/etc load we pay by adding more). The bit
>>>>>          packing
>>>>>          sort of stuff happens in data structures that there are often
>>>>>          thousands upon thousands of in a single run, or at a single
>>>>>          point in
>>>>>          time so they have a significant impact on peak memory usage. I
>>>>>          doubt
>>>>>          we have more than a handful of raw_string_ostreams in existence
>>>>>          simultaneously.
>>>>>
>>>>>
>>>>>      I'd like numbers too, but from visual inspection we can already
>>>>>      see that this introduces some worthwhile savings.
>>>>>
>>>>>      string_ostream basically just encourages and simplifies our best
>>>>>      practice of using SmallString<> on the stack, and that's never
>>>>>      been in doubt.
>>>>>
>>>>>      In the patch you can see we've fixed truncation issues in code
>>>>>      that was previously storing in an external SmallString<>, not just
>>>>>      std::string. Chandler's proposed raw_string_ostream wouldn't be a
>>>>>      suitable replacement for those uses, missing about 50% of the
>>>>>      potential fixes where we want to use this, particularly in clang
>>>>>      and LLVM's IR facility. In my patch, we cover the lot with a
>>>>>      single type.
>>>>>
>>>>>      I saw you discussing how to avoid heap-allocation of a *singleton*
>>>>>      recently on this list so I know you care at least a little about
>>>>>      this matter :-)
>>>>>
>>>>>
>>>>>              If we mix semantics in the base, we'd basically just end
>>>>>              up a raw string
>>>>>              ostream that sometimes isn't raw,  even making the name of
>>>>>              the type a
>>>>>              little dubious.
>>>>>
>>>>>          (On naming: as Chandler mentioned, these are all raw streams -
>>>>>          even
>>>>>          the new ones you're proposing, the "raw"ness is intended to
>>>>>          distinguish them from the C++ Standard stream libraries that
>>>>> have
>>>>>          streambufs and all that other stuff, not about whether they have
>>>>>          internal storage or not)
>>>>>
>>>>>
>>>>>      Bike shed. I'm happy with what's in the patch, but still open to
>>>>>      better names.
>>>>>
>>>>>      Does this go some way to covering your remaining concerns?
>>>
>>> Ping
>>
>> Re-ping.
> Sorry this is taking some time to get to, been trying to find the
> mental bandwidth and fortitude to address the above discussion.
>
>> This is starting to block other work..
> I don't quite see how. This may be a nice cleanup, but it doesn't seem
> to fundamentally make anything possible that wasn't before,

The goal is to fix existing code and provide safe string building 
facilities for newly written code.

This isn't about doing something fundamentally new -- it's about getting 
what we have already to work correctly.


> even when
> it comes to cleanup it has a fairly limited impact of changing two
> lines to one in a few places.
>
> Which is the crux of my objection/concern here - it doesn't seem
> worthwhile to add new types for that particular win.

Having spent a week dredging through, and fixing, all variety of string 
building code in LLVM I'm pretty keyed in on what's worthwhile, thanks.

The incidence rate of incorrect string output in code not covered by 
tests is *way* too high and this patch introduces the necessary facility 
to resolve that.

We all make these mistakes, but without sugar coating it, the names that 
come up in SVN blame are notably correlated to the ones proclaiming 
loudly that there isn't a pressing need for a safe primitive.


> I appreciate the benefit of safe types that you can see, at a glance,
> are safe (hence my desire to move towards std::make_shared, for
> example - and my desire to remove a pimpl you mentioned (not to save a
> heap allocation as you characterized - it doesn't actually phase me
> much, but the increased cognitive load/code complexity did, and adding
> new types has the possibility of adding more of that)).

Complexity? The new class is 9 lines of code.

It *reduces* cognitive load at point of use by turning tricky runtime 
output truncation issues into compile-time errors.

It also means you don't have to choose between raw_svector_ostream and 
raw_string_ostream any more -- you only need to know this one type to 
build strings efficiently in nearly any case.

Just read through the patch to see how this unifies the ostream types, 
reducing the number of types you need to know about to write safe and 
efficient LLVM code. I feel strongly that this not only addresses your 
primary concern about more types, but even goes ways to minimize the 
decision space for building strings.

> It seems we already have two types (raw_string_ostream and
> raw_svector_ostream) and then we'll have two more (string_ostream and
> svector_ostream - and I agree with Chandler here, the "raw" prefix
> cannot be dropped (it's there to separate these streams from the
> standard streams, since they're not cross compatible) nor used as a
> distinguisher (it doesn't tell someone reading the code what's
> different between these two kinds of types)).

Choosing a new name is exactly why I posted the patch to the list. I'll 
have another go and post shortly..

> I suspect the right thing might be a raw_string_ostream<T =
> std::string>, we could typedef the svector case, or just write it out
> explicitly depending on the number of instances (I suspect writing it
> out, possibly using decltype, would be helpful - since I assume the
> existing uses of raw_svector_ostream were due to existing small
> vectors that had certain sizes).
>
> And I'm still somewhat inclined towards optional ownership defined at
> construction time - I don't really see a major benefit to type safety
> far from the construction site.
>
> If non-owning streams are as high-risk as they appear, we could
> separate out their construction into a factory function so that it's
> more explicit:
>
> raw_string_ostream<> s; //owning
>
> std::string buf;
> raw_string_ostream<> non_owning =
> make_non_owning_raw_string_ostream(buf); //non-owning

decltype, make_non_owning_raw_string_ostream, raw_string_ostream<>, then 
hiding it all with auto?

That's not only more complex than anything in this patch, it doesn't 
even begin to provide the simple usage and correctness we've achieved here.

This thread has come down very much to coding style, and I personally 
don't think your suggestions are effective in solving this, less so that 
they'd be suited for the LLVM code base.

You're entitled to your opinion on coding style but nothing you've said 
is particularly aligned with the official LLVM style guide, or grounds 
for reverting a commit.

Where views differ on some minutiae like this, balance needs to tip one 
way or the other and I believe that should be in the direction of the 
one who's put in the time to write the code and ultimately who'll be 
required to maintaining it.

Alp.



>
> Though, at that point it could be a different type and just always
> written as "auto non_owning" and no one would probably care. As long
> as none of these types appear on any API interface. If we need a
> "str()" function that provides the current buffer on the interface,
> then I'd like that to be in /one/ place, not 4. It'll be really
> unfortunate if we need that, have functions that take a
> raw_owning_string_ostream because they want to access the string
> buffer, but then that function can't be passed a
> raw_non_owning_string_ostream, etc... I don't think that's a good API
> design.
>
>> I've gone ahead and fixed further issues currently existing on ToT as
>> pointed out by Benjamin Kramer (small buffer sizes cause inefficiency with
>> raw_svector_ostream).
>>
>> There are around 30 bug fixes for obviously flawed debug/IR dump/performance
>> and truncation issues in this patch and I've endeavoured to fix more issues
>> currently existing on ToT but really we should be fixing those as subsequent
>> commits instead of trying to lump it all into this patch, because it's
>> becoming unwieldy to maintain.
> Those bugs can be (fairly easily) fixed without this patch, so far as
> I understand it. So you can hold them on this patch or not, as it
> suits.

That doesn't scale. I'm obviously not going to backport a handful of fixes
>
>> Are we good to go?
>>
>> Alp.
>>
>>
>>
>>>
>>>>>      Alp.
>>>>>
>>>>>
>>>>>
>>>>>          - David
>>>>>
>>>>>
>>>>>      --     http://www.nuanti.com
>>>>>      the browser experts
>>>>>
>>>>>      _______________________________________________
>>>>>      llvm-commits mailing list
>>>>>      llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>>>      http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>>
>> --
>> http://www.nuanti.com
>> the browser experts
>>

-- 
http://www.nuanti.com
the browser experts




More information about the llvm-commits mailing list