[llvm-commits] Speeding up RegAllocLinearScan on big test-cases
Roman Levenstein
romix.llvm at googlemail.com
Fri May 16 08:20:29 PDT 2008
Hi,
2008/5/7 Evan Cheng <evan.cheng at apple.com>:
> Can we hook up the llvm pool allocator to std::set and use it for the
> register allocator? It's simple and it made a huge difference on Mac
> OS X when we switched all LiveInterval VNInfo allocations to it.
>
> Evan
Yes. We can hook up the llvm pool allocator to std::set. I have a working
implementation.
> On May 7, 2008, at 1:24 AM, Roman Levenstein wrote:
>
> > Hi,
> >
> > 2008/5/7 Bill Wendling <isanbard at gmail.com>:
> >> On Tue, May 6, 2008 at 3:31 PM, Evan Cheng <evan.cheng at apple.com>
> >> wrote:
> >>> On May 6, 2008, at 6:13 AM, Roman Levenstein
> >>
> >>
> >>>> But one thing about std::set that could be eventually interesting
> >>>> at
> >>>> many places
> >>>> is the following:
> >>>> - in many situations we know the maximal size of the set in
> >>>> advance.
> >>>> For example,
> >>>> in this patch, the set can contain at most all live intervals. In
> >>>> the scheduler,
> >>>> the availableQueue can contain at most all SUnits. It means that
> >>>> if we would
> >>>> be able to allocate the memory for the maximum possible number of
> >>>> elements
> >>>> in advance, then there is no need for any additional memory
> >>>> allocation.
> >>>>
> >>>> - I think, a custom STL allocator could be written that could do
> >>>> exactly this. It would
> >>>> reserve memory for the maximum number of elements (of the equal
> >>>> size?)and
> >>>> maintain a free list of cells. Then we can have a very efficient
> >>>> allocation and sets
> >>>> that do no produce to much malloc/free pressure. The same idea can
> >>>> be used
> >>>> also for some other STL containers.
> >>>>
> >>>> What do you think?
> >>>
> >>> I am not sure. I have little experience with custom STL allocators.
> >>> Perhaps others will chime in. For now, using std::set is fine.
> >>>
> >> I created custom allocators before. They aren't that bad. You just
> >> have to get the functionality correct (the allocators I created
> >> called
> >> a malloc function that already had this functionality). The major
> >> caveat is that if you don't have a well-tested memory manager
> >> available, this can lead to nasty bugs. I would stick with std::set
> >> unless it becomes a major albatross around our necks.
> >
> > I have a lot of experience with custom allocators. I extensively used
> > them in the optimizing C compiler that I have written few years ago
> > for an embedded target. Initially I was using simple malloc/free and
> > new/delete. Then I moved to GC, but at the end I switched to
> > custom memory allocators. I can only save, that they had very positive
> > impact on the performance of my compiler.
> >
> > Custom memory allocators are not such a black art, as it may seem at
> > first glance, actually. And there are quite well proven allocators.
> > Usually, they
> > are not that complex and it is rather easy to see if they are correct.
> > Normally, they are used for node-based STL containers or for most
> > typical nodes created by the compiler (e.g. SDNode, SDOperand,
> > SUnit, etc).
> > And they can really speed-up things, e.g. if they use pool
> > allocation or
> > segregated storage or if they free all the objects at once.
> > For example, imagine that we use such an allocator for SUnit nodes.
> > Then
> > it may reserve memory for N SUnit objects and very quickly allocate
> > it.
> > Once scheduling is over, all such objects can be freed at once, just
> > by
> > cleaning/deleting the custom allocator.
> >
> > I'm currently experimenting with 4-5 different custom STL allocators
> > I have found on the Internet. Once I have representative figures to
> > compare them again STL's standard allocators and after I cleaned up the code,
> > I'll report about it to this mailing list.
OK. So, I've tested 6 allocators. One of them is the standard STL
allocator. Other allocators
we found by me on the Internet and made STL-compliant via a special
templatized adapter class.
I don't want to go in detail at this moment, since the code is not
quite polished yet, but I'd
mention that bump_allocator is an STL-compliant version of LLVM's pool
allocator.
My test checks their performance for allocating node-based containers,
i.e. std::list and std::set.
I try to insert 1000000 nodes into the list and set using different allocators.
While doing it, I observe the following picture:
***Tests with <list>***
Sort (ss_allocator):0.517465
Sort (fsb_allocator):0.693605
Sort (bump_allocator):0.5398639999
Sort (fastalloc_allocator):0.5254200001
Sort (boost_fastalloc_allocator):0.520713
Sort (default allocator):0.631207
***Tests with <set>***
Insertion (ss_allocator):0.8642740001
Insertion (fsb_allocator):0.932031
Insertion (bump_allocator):0.9571639999
Insertion (fast_allocator):0.950616
Insertion (boost_fast_allocator):0.9666030001
Insertion (default_allocator):1.210076
So, we can see, that performance-wise the difference is not that huge.
But if we look at the number of new/delete calls, then it is quite different:
1) without STL standard allocator - Total of only 847(!!!) mallocs for
all of the allocators together, while adding 1000000 nodes for each
of them.
2) with STL standard allocator - Total of 2000628 mallocs for all of
the allocators together, while adding 1000000 nodes for each of them.
So, the standard allocator of STL produces a huge number of
new/delete calls. And other allocators reduce it
by almost 4 orders of magnitude. But, as mentioned before, it DOES NOT
result in a big performance difference on
my Ubuntu/Linux/x86 machine, which indicates that mallocs are very
efficient here. But for you it seems to be very different...
So the question is: why does STL allocator perform so poorly on
PowerPC that you seem to use?
Could it be that the malloc or STL implementation of your OS/compiler
is particularly bad?
Would it be possible for you to try using a custom malloc
implementation that would replace that
system malloc? E.g. could you try to link LLVM with Doug Lea's malloc,
which is a standard implementation on
the Linux/x86 systems? Or do you have other explanations for this?
BTW, boost seems to provide a nice pool allocator. And it is
"production grade" and "bullet-proof" compared
to many others. Would it be too bad, if LLVM would use it? This does
not mean that the whole boost should
be available. There is a special tool provided with boost that
extracts only those subsets of it, that are required.
And boost has a BSD-like license.
-Roman
More information about the llvm-commits
mailing list