[llvm-dev] SmallString + raw_svector_ostream combination should be more efficient
Yaron Keren via llvm-dev
llvm-dev at lists.llvm.org
Thu Aug 13 01:17:39 PDT 2015
r244870 (with the change you requested), thanks!
I initially placed the virtual function in header since they were
one-liners.
The coding standards say that an anchor() function should be supplied,
which was indeed missing. Other than required anchor function, why should
the virtual functions go in the .cpp?
Should I move the empty raw_svector_ostream destructor to the .cpp file too
as well?
2015-08-13 2:59 GMT+03:00 Rafael Espíndola <rafael.espindola at gmail.com>:
> Given that it is faster, I am OK with it.
>
> I get 0.774371631s with master and 0.649787169s with your patch.
>
> Why move all virtual functions to the header? LGTM with them still in
> the .cpp file.
>
> On 12 August 2015 at 01:53, Yaron Keren <yaron.keren at gmail.com> wrote:
> > +llvm-dev at lists.llvm.org
> >
> > The impact should be small as all the other streamers usually write
> directly
> > to the memory buffer and only when out of buffer they call write().
> >
> > OTOH, raw_svector_ostream (without a buffer) goes though write for every
> > character or block it writes. It can work without virtual write() by
> > overriding the existing virtual write_impl() but this is a slower code
> path
> > for raw_svector_ostream.
> >
> > Even though the attached raw_svector_ostream patch without virtual
> write()
> > is still significantly faster then the current raw_svector_ostream since
> it
> > avoids the double buffering. With this patch, on my system the example
> runs
> > at 750ms compared with about 1000ms for the current raw_svector_ostream.
> >
> > Would you prefer to continue review on Phabricator?
> >
> > 2015-07-11 20:48 GMT+03:00 Rafael Espíndola <rafael.espindola at gmail.com
> >:
> >>
> >> This makes write virtual. What is the impact of that on the other
> >> streamers?
> >>
> >>
> >>
> >> On 22 May 2015 at 10:17, Yaron Keren <yaron.keren at gmail.com> wrote:
> >> > Here's a performance testcase for the raw_svector_ostream patch. On my
> >> > WIndows x64 machine it runs in 1010ms with the current code and in
> 440ms
> >> > with the patch applies. Is this OK to commit?
> >> >
> >> >
> >> > 2015-05-02 21:31 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
> >> >>
> >> >> Following a hint from Duncan in http://llvm.org/pr23395, here is a
> >> >> revised
> >> >> patch. Rather then introduce the raw_char_ostream adapter, this
> version
> >> >> improves raw_svector_ostream.
> >> >>
> >> >> raw_svector_ostream is now a very lightweight adapter, running
> without
> >> >> a
> >> >> buffer and fowarding almost anything to the SmallString. This solves
> >> >> both
> >> >> the performance issue and the information duplication issue. The
> >> >> SmallString
> >> >> is always up-to-date and can be used at any time. flush() does
> nothing
> >> >> and
> >> >> is not required. resync() was kept in this patch as a no-op and will
> be
> >> >> removed with its users in a follow-up patch. I'll also try to remove
> >> >> the
> >> >> superfluous flush calls() along the way.
> >> >>
> >> >>
> >> >>
> >> >> 2015-05-02 7:41 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
> >> >>>
> >> >>> +update diff
> >> >>>
> >> >>>
> >> >>> 2015-05-02 7:38 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
> >> >>>>
> >> >>>> I outlined (is that the right word?) raw_char_ostream::grow,
> >> >>>> raw_char_ostream::write (both) into raw_ostream.cpp with less than
> >> >>>> 10%
> >> >>>> difference in performance.
> >> >>>>
> >> >>>> Profiling reveals that the real culprit is the code line
> >> >>>>
> >> >>>> OS.reserve(OS.size() + 64);
> >> >>>>
> >> >>>> called from raw_svector_ostream::write_impl called from
> >> >>>> raw_svector_ostream::~raw_svector_ostream.
> >> >>>> The issue is already documented:
> >> >>>>
> >> >>>> raw_svector_ostream::~raw_svector_ostream() {
> >> >>>> // FIXME: Prevent resizing during this flush().
> >> >>>> flush();
> >> >>>> }
> >> >>>>
> >> >>>> And a solution is provided in raw_svector_ostream::init():
> >> >>>>
> >> >>>> // Set up the initial external buffer. We make sure that the
> buffer
> >> >>>> has at
> >> >>>> // least 128 bytes free; raw_ostream itself only requires 64, but
> >> >>>> we
> >> >>>> want to
> >> >>>> // make sure that we don't grow the buffer unnecessarily on
> >> >>>> destruction (when
> >> >>>> // the data is flushed). See the FIXME below.
> >> >>>> OS.reserve(OS.size() + 128);
> >> >>>>
> >> >>>> This solution may be worse than the problem. In total:
> >> >>>>
> >> >>>> * If the SmallString was less than 128 bytes init() will always(!)
> >> >>>> heap
> >> >>>> allocate.
> >> >>>> * If it's more or equal to 128 bytes but upon destruction less than
> >> >>>> 64
> >> >>>> bytes are left unused it will heap allocate for no reason.
> >> >>>>
> >> >>>> A careful programmer who did size the SmallString to match its use
> +
> >> >>>> small reserve will find either of these heap allocations
> surprising,
> >> >>>> negating the SmallString advantage and then paying memcpy cost.
> >> >>>>
> >> >>>> I filed http://llvm.org/pr23395 about this. To temporarly avoid
> this
> >> >>>> issue, in addition to the outline methods change I commented the
> >> >>>> OS.reserve
> >> >>>> line in flush(). (This change of course requires further review.)
> >> >>>>
> >> >>>> With reserve being out of the way, the gap now is smaller but still
> >> >>>> significant: raw_char_ostream is about 20% faster than
> >> >>>> raw_svector_ostream
> >> >>>> in Release=optimized configuration.
> >> >>>>
> >> >>>>
> >> >>>> 2015-05-02 4:39 GMT+03:00 Sean Silva <chisophugis at gmail.com>:
> >> >>>>>
> >> >>>>>
> >> >>>>> Could you dig into why:
> >> >>>>> + raw_ostream &write(unsigned char C) override {
> >> >>>>> + grow(1);
> >> >>>>> + *OutBufCur++ = C;
> >> >>>>> + return *this;
> >> >>>>> + }
> >> >>>>>
> >> >>>>> Is 3 times as fast as raw_svector_ostream? I don't see a good
> reason
> >> >>>>> why that should be any faster than:
> >> >>>>>
> >> >>>>> raw_ostream &operator<<(char C) {
> >> >>>>> if (OutBufCur >= OutBufEnd)
> >> >>>>> return write(C);
> >> >>>>> *OutBufCur++ = C;
> >> >>>>> return *this;
> >> >>>>> }
> >> >>>>>
> >> >>>>>
> >> >>>>> You might just be seeing the difference between having write
> inlined
> >> >>>>> vs. non-inlined, in which case your patch might be a complicated
> way
> >> >>>>> to pull
> >> >>>>> the likely cases of some raw_ostream methods into the header.
> >> >>>>>
> >> >>>>>
> >> >>>>> -- Sean Silva
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Apr 30, 2015 at 8:02 PM, Yaron Keren <
> yaron.keren at gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Yes, we should do without virtual flush. The problem was
> >> >>>>> raw_char_ostream should not be flushed - it's always current - so
> I
> >> >>>>> overloaded flush to a no-op. A cleaner solution is attached,
> adding
> >> >>>>> a
> >> >>>>> DirectBuffer mode to the raw_ostream.
> >> >>>>>
> >> >>>>> Also added a simple performance comparison project between
> >> >>>>> raw_char_ostream and raw_svector_ostream. On Window 7 x64 machine,
> >> >>>>> raw_char_ostream was three times faster than raw_svector_ostream
> >> >>>>> when the
> >> >>>>> provided buffer size is large enough and two times as fast with
> one
> >> >>>>> dynamic
> >> >>>>> allocation resize.
> >> >>>>>
> >> >>>>>
> >> >>>>> 2015-04-30 22:48 GMT+03:00 Rafael Espíndola
> >> >>>>> <rafael.espindola at gmail.com>:
> >> >>>>>>
> >> >>>>>> I don't think we should make flush virtual. Why do you need to do
> >> >>>>>> it?
> >> >>>>>> Can't you set up the base class to write to use the tail of the
> >> >>>>>> memory
> >> >>>>>> region as the buffer?
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 24 April 2015 at 06:46, Yaron Keren <yaron.keren at gmail.com>
> >> >>>>>> wrote:
> >> >>>>>> > Hi,
> >> >>>>>> >
> >> >>>>>> > Is this what you're thinking about?
> >> >>>>>> > The code is not tested yet, I'd like to know if the overall
> >> >>>>>> > direction and
> >> >>>>>> > design are correct.
> >> >>>>>> >
> >> >>>>>> > Yaron
> >> >>>>>> >
> >> >>>>>> >
> >> >>>>>> > 2015-04-21 0:10 GMT+03:00 Yaron Keren <yaron.keren at gmail.com>:
> >> >>>>>> >>
> >> >>>>>> >> Sean, thanks for reminding this, Alp did commit a class
> derived
> >> >>>>>> >> from
> >> >>>>>> >> raw_svector_ostream conatining an internal SmallString he
> called
> >> >>>>>> >> small_string_ostream. The commit was reverted after a day due
> to
> >> >>>>>> >> a
> >> >>>>>> >> disagreement about the commit approval and apparently
> abandoned.
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223393.html
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223557.html
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140630/223986.html
> >> >>>>>> >>
> >> >>>>>> >> The class Alp comitted did solve the possible mismatch between
> >> >>>>>> >> the
> >> >>>>>> >> SmallString and the stream by making the SmallString private
> to
> >> >>>>>> >> the
> >> >>>>>> >> class.
> >> >>>>>> >> This however did not treat the root problem, the duplication
> of
> >> >>>>>> >> the
> >> >>>>>> >> information about the buffer between SmallString and the
> stream.
> >> >>>>>> >>
> >> >>>>>> >> I can make performance measurements but even if the
> performance
> >> >>>>>> >> difference
> >> >>>>>> >> is neglible (and looking at all the code paths and
> conditionals
> >> >>>>>> >> of
> >> >>>>>> >> non-inlined functions at raw_ostream.cpp that's hard to
> >> >>>>>> >> believe),
> >> >>>>>> >> the
> >> >>>>>> >> current SmallString-raw_svector_ostream is simply wrong.
> >> >>>>>> >>
> >> >>>>>> >> Personally, after the previous "Alp" discussion I found and
> >> >>>>>> >> fixed
> >> >>>>>> >> several
> >> >>>>>> >> such issues in my out-of-tree code only to make new similar
> new
> >> >>>>>> >> error
> >> >>>>>> >> earlier this year, which I caught only months after, when
> Rafael
> >> >>>>>> >> committed
> >> >>>>>> >> the pwrite which reminded me the issue. Due ot the information
> >> >>>>>> >> duplication
> >> >>>>>> >> Rafael commit also had a bug and Mehdi reports similar
> >> >>>>>> >> experience.
> >> >>>>>> >> Back then
> >> >>>>>> >> Alp reported similar problems he found in LLVM code which were
> >> >>>>>> >> hopefully
> >> >>>>>> >> fixed otherwise.
> >> >>>>>> >>
> >> >>>>>> >> In addition to the information duplication, the design is
> quite
> >> >>>>>> >> confusing
> >> >>>>>> >> to use
> >> >>>>>> >>
> >> >>>>>> >> - Should one use the stream.str() function or the SmallString
> >> >>>>>> >> itself?
> >> >>>>>> >> - flush or str?
> >> >>>>>> >> - How do you properly clear the combination for reuse (that
> was
> >> >>>>>> >> my
> >> >>>>>> >> mistake, I forgot to resync after OS.clear())?
> >> >>>>>> >>
> >> >>>>>> >> It's safe to say this design pattern is very easy to get wrong
> >> >>>>>> >> one
> >> >>>>>> >> way or
> >> >>>>>> >> another, we got burned by it multiple times and it should be
> >> >>>>>> >> replaced.
> >> >>>>>> >>
> >> >>>>>> >> Alp suggested a derived class containing its own SmallString.
> >> >>>>>> >> That's a
> >> >>>>>> >> simple and effective approach to avoid the bugs mentioned but
> >> >>>>>> >> does
> >> >>>>>> >> not fix
> >> >>>>>> >> the memory or runtime issues. Small as they may be, why have
> >> >>>>>> >> them
> >> >>>>>> >> at a
> >> >>>>>> >> fundemental data structure?
> >> >>>>>> >>
> >> >>>>>> >> I was thinking about going further avoiding all duplication
> with
> >> >>>>>> >> a
> >> >>>>>> >> templated class derived with its own internal buffer storage.
> >> >>>>>> >>
> >> >>>>>> >> Rafael suggested a more modular approach, a derived adpater
> >> >>>>>> >> class
> >> >>>>>> >> adapter
> >> >>>>>> >> to a *simple* buffer (or nullptr for fully-dynamic operation)
> >> >>>>>> >> which
> >> >>>>>> >> also
> >> >>>>>> >> won't duplicate any information the buffer is dumb and has no
> >> >>>>>> >> state.
> >> >>>>>> >>
> >> >>>>>> >> That solution sounds simple, efficient and safe to use. The
> >> >>>>>> >> implementation
> >> >>>>>> >> would be actually simpler then raw_svector_ostream since all
> the
> >> >>>>>> >> coordination logic is not required anymore.
> >> >>>>>> >>
> >> >>>>>> >>
> >> >>>>>> >> 2015-04-20 22:17 GMT+03:00 Sean Silva <chisophugis at gmail.com
> >:
> >> >>>>>> >>>
> >> >>>>>> >>>
> >> >>>>>> >>>
> >> >>>>>> >>> On Sun, Apr 19, 2015 at 7:40 AM, Yaron Keren
> >> >>>>>> >>> <yaron.keren at gmail.com>
> >> >>>>>> >>> wrote:
> >> >>>>>> >>>>
> >> >>>>>> >>>> A very common code pattern in LLVM is
> >> >>>>>> >>>>
> >> >>>>>> >>>> SmallString<128> S;
> >> >>>>>> >>>> raw_svector_ostream OS(S);
> >> >>>>>> >>>> OS<< ...
> >> >>>>>> >>>> Use OS.str()
> >> >>>>>> >>>>
> >> >>>>>> >>>> While raw_svector_ostream is smart to share the text buffer
> >> >>>>>> >>>> itself, it's
> >> >>>>>> >>>> inefficient keeping two sets of pointers to the same buffer:
> >> >>>>>> >>>>
> >> >>>>>> >>>> In SmallString: void *BeginX, *EndX, *CapacityX
> >> >>>>>> >>>> In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur
> >> >>>>>> >>>>
> >> >>>>>> >>>> Moreover, at runtime the two sets of pointers need to be
> >> >>>>>> >>>> coordinated
> >> >>>>>> >>>> between the SmallString and raw_svector_ostream using
> >> >>>>>> >>>> raw_svector_ostream::init, raw_svector_ostream::pwrite,
> >> >>>>>> >>>> raw_svector_ostream::resync and
> >> >>>>>> >>>> raw_svector_ostream::write_impl.
> >> >>>>>> >>>> All these functions have non-inlined implementations in
> >> >>>>>> >>>> raw_ostream.cpp.
> >> >>>>>> >>>>
> >> >>>>>> >>>> Finally, this may cause subtle bugs if S is modified without
> >> >>>>>> >>>> calling
> >> >>>>>> >>>> OS::resync(). This is too easy to do by mistake.
> >> >>>>>> >>>>
> >> >>>>>> >>>> In this frequent case usage the client does not really care
> >> >>>>>> >>>> about
> >> >>>>>> >>>> S
> >> >>>>>> >>>> being a SmallString with its many useful string helper
> >> >>>>>> >>>> function.
> >> >>>>>> >>>> It's just
> >> >>>>>> >>>> boilerplate code for raw_svector_ostream. But it does cost
> >> >>>>>> >>>> three
> >> >>>>>> >>>> extra
> >> >>>>>> >>>> pointers, some runtime performance and possible bugs.
> >> >>>>>> >>>
> >> >>>>>> >>>
> >> >>>>>> >>> I agree the bugs are real (Alp proposed something a while
> back
> >> >>>>>> >>> regarding
> >> >>>>>> >>> this?), but you will need to provide measurements to justify
> >> >>>>>> >>> the
> >> >>>>>> >>> cost in
> >> >>>>>> >>> runtime performance. One technique I have used in the past to
> >> >>>>>> >>> measure these
> >> >>>>>> >>> sorts of things I call "stuffing": take the operation that
> you
> >> >>>>>> >>> want to
> >> >>>>>> >>> measure, then essentially change the logic so that you pay
> the
> >> >>>>>> >>> cost 2 times,
> >> >>>>>> >>> 3 times, etc. You can then look at the trend in performance
> as
> >> >>>>>> >>> N
> >> >>>>>> >>> varies and
> >> >>>>>> >>> extrapolate back to the case where N = 0 (i.e. you don't pay
> >> >>>>>> >>> the
> >> >>>>>> >>> cost).
> >> >>>>>> >>>
> >> >>>>>> >>> For example, in one situation where I used this method it was
> >> >>>>>> >>> to
> >> >>>>>> >>> measure
> >> >>>>>> >>> the cost of stat'ing files (sys::fs::status) across a
> holistic
> >> >>>>>> >>> build, using
> >> >>>>>> >>> only "time" on the command line (it was on Windows and I
> didn't
> >> >>>>>> >>> have any
> >> >>>>>> >>> tools like DTrace available that can directly measure this).
> In
> >> >>>>>> >>> order to do
> >> >>>>>> >>> this, I changed sys::fs::status to call stat N times instead
> of
> >> >>>>>> >>> 1,
> >> >>>>>> >>> and
> >> >>>>>> >>> measured with N=1 N=2 N=3 etc. The result was that the
> >> >>>>>> >>> difference
> >> >>>>>> >>> between
> >> >>>>>> >>> the N and N+1 versions was about 1-2% across N=1..10 (or
> >> >>>>>> >>> whatever
> >> >>>>>> >>> I
> >> >>>>>> >>> measured). In order to negate caching and other confounding
> >> >>>>>> >>> effects, it is
> >> >>>>>> >>> important to try different distributions of stats; e.g. the
> >> >>>>>> >>> extra
> >> >>>>>> >>> stats are
> >> >>>>>> >>> on the same file as the "real" stat vs. the extra stats are
> on
> >> >>>>>> >>> nonexistent
> >> >>>>>> >>> files in the same directory as the "real" file vs. parent
> >> >>>>>> >>> directories of the
> >> >>>>>> >>> "real" file; if these match up fairly well (they did), then
> you
> >> >>>>>> >>> have some
> >> >>>>>> >>> indication that the "stuffing" is measuring what you want to
> >> >>>>>> >>> measure.
> >> >>>>>> >>>
> >> >>>>>> >>> So e.g. if you think the cost of 3 extra pointers is
> >> >>>>>> >>> significant,
> >> >>>>>> >>> then
> >> >>>>>> >>> "stuff" the struct with 3, 6, 9, ... extra pointers and
> measure
> >> >>>>>> >>> the
> >> >>>>>> >>> difference in performance (e.g. measure the time building a
> >> >>>>>> >>> real
> >> >>>>>> >>> project).
> >> >>>>>> >>>
> >> >>>>>> >>> -- Sean Silva
> >> >>>>>> >>>
> >> >>>>>> >>>>
> >> >>>>>> >>>>
> >> >>>>>> >>>> To solve all three issues, would it make sense to have
> >> >>>>>> >>>> raw_ostream-derived container with a its own SmallString
> like
> >> >>>>>> >>>> templated-size
> >> >>>>>> >>>> built-in buffer?
> >> >>>>>> >>>>
> >> >>>>>> >>>
> >> >>>>>> >
> >> >>>>>
> >> >>>>>
> >> >>>>
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150813/fddfda3b/attachment.html>
More information about the llvm-dev
mailing list