[PATCH] [PATCH/RFC] Use a BumpAllocator for DIEs

Tue Jan 20 16:49:15 PST 2015

> On 2015 Jan 20, at 16:44, Frédéric Riss <friss at apple.com> wrote:
> 
>> 
>> On Jan 20, 2015, at 3:33 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>> 
>>> 
>>> On 2015 Jan 20, at 12:59, Frédéric Riss <friss at apple.com> wrote:
>>> 
>>> 
>>>> On Jan 20, 2015, at 12:19 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>>> 
>>>>> 
>>>>> On 2015 Jan 20, at 11:28, Frederic Riss <friss at apple.com> wrote:
>>>>> 
>>>>> Hi dblaikie, echristo,
>>>>> 
>>>>> First of all, although this patch passes the testsuite, I don't
>>>>> think it is ready to go in, at least not without a thorough
>>>>> discussion of its downsides.
>>>>> 
>>>>> What the patch does is move away from using unique_ptr<>s to
>>>>> track DIE object memory and allocate most of them using a
>>>>> BumpAllocator. Making releasing the memory for the DIE tree
>>>>> basically free. (DIEBlocks and DIELocs are still allocated using
>>>>> the standard allocator, but they were already using separate
>>>>> bookkeeping, so this doesn't complicate the patch).
>>>>> 
>>>>> I'm going down this road because the teardown of the DIE tree
>>>>> uses a lot of CPU time, especially in dsymutil's usecase. In
>>>>> my prototype, the DIE destructors take ~40 seconds to run for
>>>>> DIE trees worth of 700MB of debug infomration. That's quite a
>>>>> lot especially considering that dsymutil's runtime on this kind
>>>>> of data should be around 2 minutes. I really need to find an
>>>>> acceptale solution for dsymutil's usecase.
>>>>> 
>>>>> This patch achieves the performance gains that I expect, but there
>>>>> is one big downside: it limits the number of attributes a DIE can
>>>>> accept to a fixed maximum (well technically, if you remove the
>>>>> assert in the code, it will work, but it will leak the attribute
>>>>> vectors going over the limit).
>>>> 
>>>> Is there a way to explicitly walk through and clear these vectors?
>>>> Or is that already too expensive (in most cases it'll be a no-op,
>>>> but you'll still need to do the walk)?
>>>> 
>>>> (IIRC, `SmallVector<>::clear()` deallocates.)
>>> 
>>> I’d need to try it to be sure, but I have the feeling it would be expensive.
>>> I seem to remember something about the average size of (encoded) DIEs
>>> being 17 bytes (I haven’t checked that fact, but the order of magnitude 
>>> feels right). So for 700MB of encoded debug information that would 
>>> represent 43 million DIEs. That’s a huge tree to walk.
>>> 
>>> Maybe it would however be possible to track these separately. Instead of
>>> the assert, introduce some bookkeeping code that would remember the
>>> DIEs that need their attributes freed.  This is ugly in the sense that it needs
>>> access to the bookkeeping structures from the addAttribute method.
>>> Maybe the ugliness can be alleviated by having all the DIE operations go
>>> through some new DIEAllocator object. Instead of
>>> DIE->addValue(dwarf::DW_AT_something, Val);
>>> you’d need to write
>>> DIEAlloc.addValue(DIE, dwarf::DW_AT_something, Val);
>>> 
>>> Thoughts?
>> 
>> This seems fine to me.
>> 
>> Might be worth trying the walking approach (just in case your intuition
>> is off) since it seems simpler, but tracking it directly seems fine too.
>> 
>> I have another idea, but first: there's no `removeValue()` after
>> `addValue()`, right?  (It's a one-way operation?)
> 
> AFAIK this is true.
> 
>> If I'm right, and assuming you don't need random access or anything (you
>> just need to walk through the values), you could eschew SmallVector
>> entirely, and instead have a linked list of 8-element (or whatever)
>> allocations, all tracked with the same BumpPtrAllocator.
> 
> Or maybe just use BumpPtrAllocated ilist nodes like I do for the DIE
> themselves.
> 

I was trying to figure out if an intrusive list was valid here.  It
looked like some of the `DIEValue`s are reused across different `DIE`s,
but maybe I read the code wrong?

> Thanks,
> Fred
> 
>> Something like this:
>> 
>>    class DIEValueSequence {
>>      constexpr unsigned NumPerAlloc = 7;
>>      struct List {
>>        List *Next = nullptr;
>>        DIEValue *Vals[NumPerAlloc] = {nullptr};
>>      };
>>      List Head;
>>      unsigned Size = 0;
>>      unsigned TailSize = 0;
>>      List *Tail = Head;
>> 
>>    public:
>>      void push(BumpPtrAllocator &Alloc, DIEValue *Val) {
>>        if (TailSize == NumPerAlloc) {
>>          Tail->Next = new (Alloc.allocate(List)) List;
>>          Tail = Tail->Next;
>>          TailSize = 0;
>>        }
>>        Tail->Vals[TailSize++] = Val;
>>      }
>> 
>>      struct iterator {
>>        List *L;
>>        unsigned I;
>> 
>>      public:
>>        void operator++() {
>>          if (++I < NumPerAlloc)
>>            return;
>>          L = L->Next;
>>          I = 0;
>>        }
>>        DIEValue *operator*() const { return L->Vals[I]; }
>>        bool operator==(const iterator &X) const {
>>          return L == X.L && I == X.I;
>>        }
>>      };
>> 
>>      iterator begin() { return iterator{&Head, 0}; }
>>      iterator end() {
>>        return TailSize == NumPerAlloc ? iterator{nullptr, 0}
>>                                       : iterator{Tail, TailSize};
>>      }
>>    };
>> 
>> ^ would get you what you need without too much code.  You could
>> probably merge the attributes into the same sequence as the
>> values so that there's no need to template the data structure or
>> use zip-iterators or whatever.
>> But I don't know, maybe you need random access?
>> 
>> David, do you have any better ideas?
>> 
>>> 
>>> Fred
>>> 
>>> 
>>>>> 
>>>>> Imposing this limit on users seems too strong as a constraint. 
>>>>> Especially considering not all frontends are in-tree. I can imagine
>>>>> more flexible solutions involving complicated allocation strategies,
>>>>> but I wanted to ask for opinions before I work on something that I
>>>>> fear would end up much more complicated than this patch. Maybe
>>>>> there's an easier solution that I haven't seen, happy to hear
>>>>> suggestions.
>>>>> 
>>>>> BTW, the runtime difference is definitely measurable on a clang
>>>>> debug build also, although the gained seconds do not represent
>>>>> 40% of the runtime like in dsymutil's case. On my machine, a clang
>>>>> debug build goes from:
>>>>> 
>>>>> real    14m38.308s
>>>>> user    45m26.856s
>>>>> sys     3m53.519s
>>>>> 
>>>>> without the patch, to:
>>>>> 
>>>>> real    14m19.762s
>>>>> user    44m14.438s
>>>>> sys     3m45.520s
>>>>> 
>>>>> Shaving more than one minute of user time and 15 to 20 seconds from
>>>>> wall-clock time.
>>>>> 
>>>>> http://reviews.llvm.org/D7072
>>>>> 
>>>>> Files:
>>>>> include/llvm/CodeGen/DIE.h
>>>>> lib/CodeGen/AsmPrinter/DIE.cpp
>>>>> lib/CodeGen/AsmPrinter/DIEHash.cpp
>>>>> lib/CodeGen/AsmPrinter/DwarfCompileUnit.cpp
>>>>> lib/CodeGen/AsmPrinter/DwarfCompileUnit.h
>>>>> lib/CodeGen/AsmPrinter/DwarfDebug.cpp
>>>>> lib/CodeGen/AsmPrinter/DwarfFile.cpp
>>>>> lib/CodeGen/AsmPrinter/DwarfUnit.cpp
>>>>> lib/CodeGen/AsmPrinter/DwarfUnit.h
>>>>> unittests/CodeGen/DIEHashTest.cpp
>>>>> 
>>>>> EMAIL PREFERENCES
>>>>> http://reviews.llvm.org/settings/panel/emailpreferences/
>>>>> <D7072.18447.patch>_______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits