[LLVMdev] SmallString + raw_svector_ostream combination should be more efficient

Sat Jul 11 10:48:42 PDT 2015

This makes write virtual. What is the impact of that on the other streamers?

On 22 May 2015 at 10:17, Yaron Keren <yaron.keren at gmail.com> wrote:
> Here's a performance testcase for the raw_svector_ostream patch. On my
> WIndows x64 machine it runs in 1010ms with the current code and in 440ms
> with the patch applies. Is this OK to commit?
>
>
> 2015-05-02 21:31 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
>>
>> Following a hint from Duncan in http://llvm.org/pr23395, here is a revised
>> patch. Rather then introduce the raw_char_ostream adapter, this version
>> improves raw_svector_ostream.
>>
>> raw_svector_ostream is now a very lightweight adapter, running without a
>> buffer and fowarding almost anything to the SmallString. This solves both
>> the performance issue and the information duplication issue. The SmallString
>> is always up-to-date and can be used at any time. flush() does nothing and
>> is not required. resync() was kept in this patch as a no-op and will be
>> removed with its users in a follow-up patch. I'll also try to remove the
>> superfluous flush calls() along the way.
>>
>>
>>
>> 2015-05-02 7:41 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
>>>
>>> +update diff
>>>
>>>
>>> 2015-05-02 7:38 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
>>>>
>>>> I outlined (is that the right word?) raw_char_ostream::grow,
>>>> raw_char_ostream::write (both) into raw_ostream.cpp with less than 10%
>>>> difference in performance.
>>>>
>>>> Profiling reveals that the real culprit is the code line
>>>>
>>>>  OS.reserve(OS.size() + 64);
>>>>
>>>> called from raw_svector_ostream::write_impl called from
>>>> raw_svector_ostream::~raw_svector_ostream.
>>>> The issue is already documented:
>>>>
>>>> raw_svector_ostream::~raw_svector_ostream() {
>>>>   // FIXME: Prevent resizing during this flush().
>>>>   flush();
>>>> }
>>>>
>>>> And a solution is provided in raw_svector_ostream::init():
>>>>
>>>>   // Set up the initial external buffer. We make sure that the buffer
>>>> has at
>>>>   // least 128 bytes free; raw_ostream itself only requires 64, but we
>>>> want to
>>>>   // make sure that we don't grow the buffer unnecessarily on
>>>> destruction (when
>>>>   // the data is flushed). See the FIXME below.
>>>>   OS.reserve(OS.size() + 128);
>>>>
>>>> This solution may be worse than the problem. In total:
>>>>
>>>> * If the SmallString was less than 128 bytes init() will always(!) heap
>>>> allocate.
>>>> * If it's more or equal to 128 bytes but upon destruction less than 64
>>>> bytes are left unused it will heap allocate for no reason.
>>>>
>>>> A careful programmer who did size the SmallString to match its use +
>>>> small reserve will find either of these heap allocations surprising,
>>>> negating the SmallString advantage and then paying memcpy cost.
>>>>
>>>> I filed http://llvm.org/pr23395 about this. To temporarly avoid this
>>>> issue, in addition to the outline methods change I commented the OS.reserve
>>>> line in flush(). (This change of course requires further review.)
>>>>
>>>> With reserve being out of the way, the gap now is smaller but still
>>>> significant: raw_char_ostream is about 20% faster than raw_svector_ostream
>>>> in Release=optimized configuration.
>>>>
>>>>
>>>> 2015-05-02 4:39 GMT+03:00 Sean Silva <chisophugis at gmail.com>:
>>>>>
>>>>>
>>>>> Could you dig into why:
>>>>> +  raw_ostream &write(unsigned char C) override {
>>>>> +    grow(1);
>>>>> +    *OutBufCur++ = C;
>>>>> +    return *this;
>>>>> +  }
>>>>>
>>>>> Is 3 times as fast as raw_svector_ostream? I don't see a good reason
>>>>> why that should be any faster than:
>>>>>
>>>>>   raw_ostream &operator<<(char C) {
>>>>>     if (OutBufCur >= OutBufEnd)
>>>>>       return write(C);
>>>>>     *OutBufCur++ = C;
>>>>>     return *this;
>>>>>   }
>>>>>
>>>>>
>>>>> You might just be seeing the difference between having write inlined
>>>>> vs. non-inlined, in which case your patch might be a complicated way to pull
>>>>> the likely cases of some raw_ostream methods into the header.
>>>>>
>>>>>
>>>>> -- Sean Silva
>>>>
>>>>
>>>> On Thu, Apr 30, 2015 at 8:02 PM, Yaron Keren <yaron.keren at gmail.com>
>>>> wrote:
>>>>>
>>>>> Yes, we should do without virtual flush. The problem was
>>>>> raw_char_ostream should not be flushed - it's always current - so I
>>>>> overloaded flush to a no-op. A cleaner solution is attached, adding a
>>>>> DirectBuffer mode to the raw_ostream.
>>>>>
>>>>> Also added a simple performance comparison project between
>>>>> raw_char_ostream and raw_svector_ostream. On Window 7 x64 machine,
>>>>> raw_char_ostream was three times faster than raw_svector_ostream when the
>>>>> provided buffer size is large enough and two times as fast with one dynamic
>>>>> allocation resize.
>>>>>
>>>>>
>>>>> 2015-04-30 22:48 GMT+03:00 Rafael Espíndola
>>>>> <rafael.espindola at gmail.com>:
>>>>>>
>>>>>> I don't think we should make flush virtual. Why do you need to do it?
>>>>>> Can't you set up the base class to write to use the tail of the memory
>>>>>> region as the buffer?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24 April 2015 at 06:46, Yaron Keren <yaron.keren at gmail.com> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Is this what you're thinking about?
>>>>>> > The code is not tested yet, I'd like to know if the overall
>>>>>> > direction and
>>>>>> > design are correct.
>>>>>> >
>>>>>> > Yaron
>>>>>> >
>>>>>> >
>>>>>> > 2015-04-21 0:10 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
>>>>>> >>
>>>>>> >> Sean, thanks for reminding this, Alp did commit a class derived
>>>>>> >> from
>>>>>> >> raw_svector_ostream conatining an internal SmallString he called
>>>>>> >> small_string_ostream. The commit was reverted after a day due to a
>>>>>> >> disagreement about the commit approval and apparently abandoned.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223393.html
>>>>>> >>
>>>>>> >>
>>>>>> >> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223557.html
>>>>>> >>
>>>>>> >>
>>>>>> >> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140630/223986.html
>>>>>> >>
>>>>>> >> The class Alp comitted did solve the possible mismatch between the
>>>>>> >> SmallString and the stream by making the SmallString private to the
>>>>>> >> class.
>>>>>> >> This however did not treat the root problem, the duplication of the
>>>>>> >> information about the buffer between SmallString and the stream.
>>>>>> >>
>>>>>> >> I can make performance measurements but even if the performance
>>>>>> >> difference
>>>>>> >> is neglible (and looking at all the code paths and conditionals of
>>>>>> >> non-inlined functions at raw_ostream.cpp that's hard to believe),
>>>>>> >> the
>>>>>> >> current SmallString-raw_svector_ostream is simply wrong.
>>>>>> >>
>>>>>> >> Personally, after the previous "Alp" discussion I found and fixed
>>>>>> >> several
>>>>>> >> such issues in my out-of-tree code only to make new similar new
>>>>>> >> error
>>>>>> >> earlier this year, which I caught only months after, when Rafael
>>>>>> >> committed
>>>>>> >> the pwrite which reminded me the issue. Due ot the information
>>>>>> >> duplication
>>>>>> >> Rafael commit also had a bug and Mehdi reports similar experience.
>>>>>> >> Back then
>>>>>> >> Alp reported similar problems he found in LLVM code which were
>>>>>> >> hopefully
>>>>>> >> fixed otherwise.
>>>>>> >>
>>>>>> >> In addition to the information duplication, the design is quite
>>>>>> >> confusing
>>>>>> >> to use
>>>>>> >>
>>>>>> >> - Should one use the stream.str() function or the SmallString
>>>>>> >> itself?
>>>>>> >> - flush or str?
>>>>>> >> - How do you properly clear the combination for reuse (that was my
>>>>>> >> mistake, I forgot to resync after OS.clear())?
>>>>>> >>
>>>>>> >> It's safe to say this design pattern is very easy to get wrong one
>>>>>> >> way or
>>>>>> >> another, we got burned by it multiple times and it should be
>>>>>> >> replaced.
>>>>>> >>
>>>>>> >> Alp suggested a derived class containing its own SmallString.
>>>>>> >> That's a
>>>>>> >> simple and effective approach to avoid the bugs mentioned but does
>>>>>> >> not fix
>>>>>> >> the memory or runtime issues. Small as they may be, why have them
>>>>>> >> at a
>>>>>> >> fundemental data structure?
>>>>>> >>
>>>>>> >> I was thinking about going further avoiding all duplication with a
>>>>>> >> templated class derived with its own internal buffer storage.
>>>>>> >>
>>>>>> >> Rafael suggested a more modular approach, a derived adpater class
>>>>>> >> adapter
>>>>>> >> to a *simple* buffer (or nullptr for fully-dynamic operation) which
>>>>>> >> also
>>>>>> >> won't duplicate any information the buffer is dumb and has no
>>>>>> >> state.
>>>>>> >>
>>>>>> >> That solution sounds simple, efficient and safe to use. The
>>>>>> >> implementation
>>>>>> >> would be actually simpler then raw_svector_ostream since all the
>>>>>> >> coordination logic is not required anymore.
>>>>>> >>
>>>>>> >>
>>>>>> >> 2015-04-20 22:17 GMT+03:00 Sean Silva <chisophugis at gmail.com>:
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Sun, Apr 19, 2015 at 7:40 AM, Yaron Keren
>>>>>> >>> <yaron.keren at gmail.com>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> A very common code pattern in LLVM is
>>>>>> >>>>
>>>>>> >>>>  SmallString<128> S;
>>>>>> >>>>  raw_svector_ostream OS(S);
>>>>>> >>>>  OS<< ...
>>>>>> >>>>  Use OS.str()
>>>>>> >>>>
>>>>>> >>>> While raw_svector_ostream is smart to share the text buffer
>>>>>> >>>> itself, it's
>>>>>> >>>> inefficient keeping two sets of pointers to the same buffer:
>>>>>> >>>>
>>>>>> >>>>  In SmallString: void *BeginX, *EndX, *CapacityX
>>>>>> >>>>  In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur
>>>>>> >>>>
>>>>>> >>>> Moreover, at runtime the two sets of pointers need to be
>>>>>> >>>> coordinated
>>>>>> >>>> between the SmallString and raw_svector_ostream using
>>>>>> >>>> raw_svector_ostream::init, raw_svector_ostream::pwrite,
>>>>>> >>>> raw_svector_ostream::resync and raw_svector_ostream::write_impl.
>>>>>> >>>> All these functions have non-inlined implementations in
>>>>>> >>>> raw_ostream.cpp.
>>>>>> >>>>
>>>>>> >>>> Finally, this may cause subtle bugs if S is modified without
>>>>>> >>>> calling
>>>>>> >>>> OS::resync(). This is too easy to do by mistake.
>>>>>> >>>>
>>>>>> >>>> In this frequent case usage the client does not really care about
>>>>>> >>>> S
>>>>>> >>>> being a SmallString with its many useful string helper function.
>>>>>> >>>> It's just
>>>>>> >>>> boilerplate code for raw_svector_ostream. But it does cost three
>>>>>> >>>> extra
>>>>>> >>>> pointers, some runtime performance and possible bugs.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I agree the bugs are real (Alp proposed something a while back
>>>>>> >>> regarding
>>>>>> >>> this?), but you will need to provide measurements to justify the
>>>>>> >>> cost in
>>>>>> >>> runtime performance. One technique I have used in the past to
>>>>>> >>> measure these
>>>>>> >>> sorts of things I call "stuffing": take the operation that you
>>>>>> >>> want to
>>>>>> >>> measure, then essentially change the logic so that you pay the
>>>>>> >>> cost 2 times,
>>>>>> >>> 3 times, etc. You can then look at the trend in performance as N
>>>>>> >>> varies and
>>>>>> >>> extrapolate back to the case where N = 0 (i.e. you don't pay the
>>>>>> >>> cost).
>>>>>> >>>
>>>>>> >>> For example, in one situation where I used this method it was to
>>>>>> >>> measure
>>>>>> >>> the cost of stat'ing files (sys::fs::status) across a holistic
>>>>>> >>> build, using
>>>>>> >>> only "time" on the command line (it was on Windows and I didn't
>>>>>> >>> have any
>>>>>> >>> tools like DTrace available that can directly measure this). In
>>>>>> >>> order to do
>>>>>> >>> this, I changed sys::fs::status to call stat N times instead of 1,
>>>>>> >>> and
>>>>>> >>> measured with N=1 N=2 N=3 etc. The result was that the difference
>>>>>> >>> between
>>>>>> >>> the N and N+1 versions was about 1-2% across N=1..10 (or whatever
>>>>>> >>> I
>>>>>> >>> measured). In order to negate caching and other confounding
>>>>>> >>> effects, it is
>>>>>> >>> important to try different distributions of stats; e.g. the extra
>>>>>> >>> stats are
>>>>>> >>> on the same file as the "real" stat vs. the extra stats are on
>>>>>> >>> nonexistent
>>>>>> >>> files in the same directory as the "real" file vs. parent
>>>>>> >>> directories of the
>>>>>> >>> "real" file; if these match up fairly well (they did), then you
>>>>>> >>> have some
>>>>>> >>> indication that the "stuffing" is measuring what you want to
>>>>>> >>> measure.
>>>>>> >>>
>>>>>> >>> So e.g. if you think the cost of 3 extra pointers is significant,
>>>>>> >>> then
>>>>>> >>> "stuff" the struct with 3, 6, 9, ... extra pointers and measure
>>>>>> >>> the
>>>>>> >>> difference in performance (e.g. measure the time building a real
>>>>>> >>> project).
>>>>>> >>>
>>>>>> >>> -- Sean Silva
>>>>>> >>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> To solve all three issues, would it make sense to have
>>>>>> >>>> raw_ostream-derived container with a its own SmallString like
>>>>>> >>>> templated-size
>>>>>> >>>> built-in buffer?
>>>>>> >>>>
>>>>>> >>>
>>>>>> >
>>>>>
>>>>>
>>>>
>