[LLVMdev] Testcases where GVN uses too much memory?

Sat May 3 13:22:32 PDT 2014

On May 3, 2014, at 7:59 AM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:

> On 2014 May 2, at 21:22, Owen Anderson <resistor at mac.com> wrote:
> 
>>> Each element in `LeaderTable` was an intrusively linked list of
>>> `LeaderTableEntry`, with each element of the list costing 24B (assuming
>>> 64-bit pointers).
>>> 
>>> - Remove the link from `LeaderTableEntry`, slimming it down to 16B.
>>> 
>>> - Replace the linked list with a hand-rolled vector, with size and
>>>   capacity each represented using `uint32_t`.
>>> 
>>> If the list has a single entry, it still costs 24B and does not require
>>> a dereference.  If it has more than one element, it mallocs an array and
>>> costs 24B plus 16B per entry.
>>> 
>>> Since the entries will be next to each other in memory, we get a
>>> locality savings here.
>> 
>> I don’t follow the logic here.  My recollection is that the common case is a single element in the list, for which the data structure is already optimized: the head element (not a pointer, but the actual element) is stored in-line with the LeaderTable itself, requiring no additional allocation at all.
> 
> Right, it's already optimized for the common case.  My goal was to leave
> the common case about the same, but improve the rare case.  When GVN
> "blows up", is it just a really big common case, or are entries in
> `LeaderTable` with 3+ elements dominating the memory footprint?
> 
> After this patch:
> 
>  - The memory footprint of the common case is unchanged: if there's
>    only one element, then it's still 24B stored in-line with no
>    additional allocation.  However, runtime speed is probably slower
>    because of the branch to check where the storage is.  (How much
>    slower?)
> 
>  - In the rare two element case, there's one allocation of 32B instead
>    of one allocation of 24B.  Depending on the malloc, this might be
>    equivalent, or might be more expensive.  Runtime speed to visit them
>    should be dominated by the single dereference in both cases.
> 
>  - In the rare 3+ element case, there's 1 x N*16B allocation instead of
>    (N-1) x 24B allocations.  This is where the patch might add value
>    (both memory footprint and runtime speed).
> 
> But it's speculative enough that it's needs supporting data.

I agree.  It seems like it has the potential to be worthwhile, but we need at the very least a histogram of the frequencies of various list lengths.  Ideally, we  want to measure both the memory footprint and the runtime impact of this change.

—Owen