[llvm-commits] Speeding up RegAllocLinearScan on big test-cases

Fri May 16 08:20:29 PDT 2008

Hi,

2008/5/7 Evan Cheng <evan.cheng at apple.com>:
> Can we hook up the llvm pool allocator to std::set and use it for the
>  register allocator? It's simple and it made a huge difference on Mac
>  OS X when we switched all LiveInterval VNInfo allocations to it.
>
>  Evan

Yes. We can hook up the llvm pool allocator to std::set. I have a working
implementation.

>  On May 7, 2008, at 1:24 AM, Roman Levenstein wrote:
>
>  > Hi,
>  >
>  > 2008/5/7 Bill Wendling <isanbard at gmail.com>:
>  >> On Tue, May 6, 2008 at 3:31 PM, Evan Cheng <evan.cheng at apple.com>
>  >> wrote:
>  >>> On May 6, 2008, at 6:13 AM, Roman Levenstein
>  >>
>  >>
>  >>>> But one thing about std::set that could be eventually interesting
>  >>>> at
>  >>>> many places
>  >>>> is the following:
>  >>>> - in many situations we know the maximal size of the set in
>  >>>> advance.
>  >>>> For example,
>  >>>>  in this patch, the set can contain at most all live intervals. In
>  >>>> the scheduler,
>  >>>>  the availableQueue can contain at most all SUnits. It means that
>  >>>> if we would
>  >>>>  be able to allocate the memory for the maximum possible number of
>  >>>> elements
>  >>>>  in advance, then there is no need for any additional memory
>  >>>> allocation.
>  >>>>
>  >>>> - I think, a custom STL allocator could be written that could do
>  >>>> exactly this. It would
>  >>>> reserve memory for the maximum number of elements (of the equal
>  >>>> size?)and
>  >>>> maintain a free list of cells. Then we can have a very efficient
>  >>>> allocation and sets
>  >>>> that do no produce to much malloc/free pressure. The same idea can
>  >>>> be used
>  >>>> also for some other STL containers.
>  >>>>
>  >>>> What do you think?
>  >>>
>  >>> I am not sure. I have little experience with custom STL allocators.
>  >>> Perhaps others will chime in. For now, using std::set is fine.
>  >>>
>  >> I created custom allocators before. They aren't that bad. You just
>  >> have to get the functionality correct (the allocators I created
>  >> called
>  >> a malloc function that already had this functionality). The major
>  >> caveat is that if you don't have a well-tested memory manager
>  >> available, this can lead to nasty bugs. I would stick with std::set
>  >> unless it becomes a major albatross around our necks.
>  >
>  > I have a lot of experience with custom allocators. I extensively used
>  > them in the optimizing C compiler that  I have written few years ago
>  > for an embedded target. Initially I was using simple malloc/free and
>  > new/delete. Then I moved to GC, but at the end I switched to
>  > custom memory allocators. I can only save, that they had very positive
>  > impact on the performance of my compiler.
>  >
>  > Custom memory allocators are not such a black art, as it may seem at
>  > first glance, actually. And there are quite well proven allocators.
>  > Usually, they
>  > are not that complex and it is rather easy to see if they are correct.
>  > Normally, they are used for node-based STL containers or for most
>  > typical nodes created by the compiler (e.g. SDNode, SDOperand,
>  > SUnit, etc).
>  > And they can really speed-up things, e.g. if they use pool
>  > allocation or
>  > segregated storage or if they free all the objects at once.
>  > For example, imagine that we use such an allocator for SUnit nodes.
>  > Then
>  > it may reserve memory for N SUnit objects and very quickly allocate
>  > it.
>  > Once scheduling is over, all such objects can be freed at once, just
>  > by
>  > cleaning/deleting the custom allocator.
>  >
>  > I'm currently experimenting with 4-5 different custom STL allocators
>  > I have found on the Internet. Once I have representative figures to
>  > compare them again STL's standard allocators and after I cleaned up the code,
>  > I'll report about it to this mailing list.

OK. So, I've tested 6 allocators. One of them is the standard STL
allocator. Other allocators
we found by me on the Internet and made STL-compliant via a special
templatized adapter class.
I don't want to go in detail at this moment, since the code is not
quite polished yet, but I'd
mention that bump_allocator is an STL-compliant version of LLVM's pool
allocator.

My test checks their performance for allocating node-based containers,
i.e. std::list and std::set.
I try to insert 1000000 nodes into the list and set using different allocators.
While doing it, I observe the following picture:

***Tests with <list>***
Sort (ss_allocator):0.517465
Sort (fsb_allocator):0.693605
Sort (bump_allocator):0.5398639999
Sort (fastalloc_allocator):0.5254200001
Sort (boost_fastalloc_allocator):0.520713
Sort (default allocator):0.631207

***Tests with <set>***
Insertion (ss_allocator):0.8642740001
Insertion (fsb_allocator):0.932031
Insertion (bump_allocator):0.9571639999
Insertion (fast_allocator):0.950616
Insertion (boost_fast_allocator):0.9666030001
Insertion (default_allocator):1.210076

So, we can see, that performance-wise the difference is not that huge.
But if we look at the number of new/delete calls, then it is quite different:
1) without STL standard allocator - Total of only 847(!!!) mallocs for
all of the allocators together,  while adding 1000000 nodes for each
of them.
2) with STL standard allocator - Total of 2000628 mallocs for all of
the allocators together, while adding 1000000 nodes for each of them.

So,  the standard allocator of STL produces a huge number of
new/delete calls. And other allocators reduce it
by almost 4 orders of magnitude. But, as mentioned before, it DOES NOT
result in a big performance difference on
my Ubuntu/Linux/x86 machine, which indicates that mallocs are very
efficient here. But for you it seems to be very different...

So the question is: why does STL allocator perform so poorly on
PowerPC that you seem to use?
Could it be that the malloc or STL implementation of your OS/compiler
is particularly bad?
Would it be possible for you to try using a custom malloc
implementation that would replace that
system malloc? E.g. could you try to link LLVM with Doug Lea's malloc,
which is a standard implementation on
the Linux/x86 systems? Or do you have other explanations for this?

BTW, boost seems to provide a nice pool allocator. And it is
"production grade" and "bullet-proof" compared
to many others. Would it be too bad, if LLVM would use it? This does
not mean that the whole boost should
be available. There is a special tool provided with boost that
extracts only those subsets of it, that are required.
And boost has a BSD-like license.

-Roman