Attempts at speeding up StringTableBuilder

Sun Oct 16 07:03:10 PDT 2016

Thanks. For 1 I need to pass a size_t*, no?

I did try sorting by hash. It avoided the misses in the hash insertion, but was not faster overall.

I don't expect any temporal locally.

I might give the large page idea a try.

Cheers,
Rafael

On October 16, 2016 1:06:55 AM EDT, Sean Silva <chisophugis at gmail.com> wrote:
>3 suggestions:
>
>1.  If this is constantly stalling on dcache misses (which it sounds
>like
>it is), try making the methods be able to do operation on many strings
>at
>once, then for the critical cache-missing memory accesses, issue them
>all
>very near each other in the instruction stream so that the CPU can
>pipeline
>all the memory requests[1]. E.g. Haswell's L2 can keep 16 simultaneous
>outstanding misses.
>
>I suspect that the hardware page table walker can't do multiple
>parallel
>requests, so this may not help if you are bottlenecked on that. E.g.
>Haswell's L1 DTLB can hold 64 4K pages, so about 256KB of working set.
>The
>numbers you quoted above for the number of unique strings in the
>firefox
>test case suggests about 16MB table size assuming 8 bytes for each
>entry
>(which is conservative; the keys are larger), so this may be an issue,
>considering the highly random nature of hash lookups. If that is a
>problem,
>try allocating the hash table in 2MB pages if it isn't already;
>Haswell's
>L1 DTLB can hold 32 of those which is plenty working set for the table
>in
>this case. Even the 1024 entries in the L2 DTLB will only be enough to
>cover 4MB with 4K pages (this is optimistic as it assumes no way
>conflicts).
>
>2. If there is temporal locality in the string lookups, try putting a
>small, bounded-size "cache" in front of the larger table. (e.g. even a
>two-element cache for the offset lookup in
>https://reviews.llvm.org/D20645#440638 made LLD -O1 8% faster). For
>StringTableBuilder, a small hash table may be enough. This may also be
>beneficial for avoiding cost from duplicates in the "pass in a size_t*
>to
>add" approach.
>
>3. For the "pass in a size_t* to add" approach, did you try
>parallelizing
>the final merge step? Parallelizing (or in general optimizing) a single
>large final operation should be much easier than trying to get
>parallelism
>in other ways from this code. Applying 1. will probably be easiest in
>the
>context of a large final merge operation as well. To mitigate TLB costs
>(if
>that is a problem), doing a pseudo-sort (e.g. a quicksort with a
>bounded
>recursion) of the hash values mod the table size may be effective,
>especially if this can be leveraged to approximately partition the hash
>table bucket array across cores (each of N cores then only has
>TableSize/N
>bytes of working set in the table, which should give a superlinear
>speedup
>from parallelization due to improved cache utilization). The
>equidistribution of hash values has some nice properties that could be
>leveraged for fast approximate partitioning / pseudo-sorting (at the
>very
>least, it makes pivot selection easy, just HASH_T_MAX/2). In fact, you
>may
>not even need to do any pseudo-sorting. Just assign each thread part of
>the
>[0, TableSize) space and have all threads walk the entire array of
>values
>to be inserted, but skip any that don't fall into its partition of the
>table space.
>
>
>[1] Something like this might be useful (it is critical that as many of
>the
>potentially cache-missing operations fit into the reorder buffer as
>possible, so that they can be simultaneously pending):
>
>template <typename T, int N>
>__attribute__((noinline))
>void loadFromPointers(T **PtrArr, T *__restrict__ OutVals) {
>// The compiler should hopefully fully unroll these loops and fully
>SROA
>all the fixed-size arrays.
>
>  T *Ptrs[N];
>  for (int i = 0; i < N; i++)
>    Ptrs[i] = PtrArr[i];
>// If necessary, to avoid the compiler from moving instructions into
>the
>critical loop.
>  //asm("");
>
>  T Vals[N];
>// This is the critical part. As many loads as possible must fit into
>the
>reorder buffer.
>  // This should hopefully compile into a bunch of sequential load
>instructions with nothing in between.
>  for (int i = 0; i < N; i++)
>    Vals[i] = *Ptrs[i];
>
>// If necessary, to avoid the compiler from moving instructions into
>the
>critical loop.
>  //asm("");
>  for (int i = 0; i < N; i++)
>    OutVals[i] = Vals[i];
>}
>
>-- Sean Silva
>
>
>
>On Fri, Oct 14, 2016 at 11:23 AM, Rafael Espíndola via llvm-commits <
>llvm-commits at lists.llvm.org> wrote:
>
>> I have put some effort to try to speed up StringTableBuilder. The
>last
>> thing that worked was committed as r284249.
>>
>> The main difficulty in optimizing it is the large number of strings
>it
>> has to handle. In the case of xul, one of the string tables has
>> 14_375_801 strings added, out of which only 1_726_762 are unique.
>>
>> The things I tried:
>>
>> * Instead of returning size_t from add, pass in a size_t* to add.
>This
>> allows us to remember all StringRefs and only do the merging in
>> finalize. This would be extra helpful for the -O2 case by not needing
>> an extra hash lookup. The idea for -O1 was to avoid hash resizing and
>> improve cache hit by calling StringIndexMap.reserve. Unfortunately
>> given how many strings are duplicated this is not profitable.
>>
>> * Using a DenseSet with just an unsigned in it and a side std::vector
>> with the rest of the information. The idea is that the vector doesn't
>> contain empty keys, so it should be denser. This reduced the cache
>> misses accessing the set, but the extra vector compensated for it.
>>
>> * Creating the string buffer incrementally in add. The idea is that
>> then we don't need to store the pointer to the string. We can find
>out
>> what the string is with just the offset. This was probably the most
>> promising. It reduced the total number of cache misses reported by
>> perf stat, but the overall time didn't improve significantly. This
>> also makes -O2 substantially harder to implement. I have attached the
>> patch that implements this (note that -O2 is not implemented).
>>
>> * Not merging constants to see if special casing them would make a
>> difference. No test speeds ups by even 1%.
>>
>> In summary, it seems the string merging at -O1 is as fast as it gets
>> short of someone knowing a crazy algorithm. At -O2 it should be
>> possible to avoid the second hash lookup by passing a size_t* to add,
>> but it is not clear if that would be worth the code complexity.
>>
>> Cheers,
>> Rafael
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161016/1d8e89e3/attachment.html>