Attempts at speeding up StringTableBuilder

Sat Oct 15 22:06:55 PDT 2016

3 suggestions:

1.  If this is constantly stalling on dcache misses (which it sounds like
it is), try making the methods be able to do operation on many strings at
once, then for the critical cache-missing memory accesses, issue them all
very near each other in the instruction stream so that the CPU can pipeline
all the memory requests[1]. E.g. Haswell's L2 can keep 16 simultaneous
outstanding misses.

I suspect that the hardware page table walker can't do multiple parallel
requests, so this may not help if you are bottlenecked on that. E.g.
Haswell's L1 DTLB can hold 64 4K pages, so about 256KB of working set. The
numbers you quoted above for the number of unique strings in the firefox
test case suggests about 16MB table size assuming 8 bytes for each entry
(which is conservative; the keys are larger), so this may be an issue,
considering the highly random nature of hash lookups. If that is a problem,
try allocating the hash table in 2MB pages if it isn't already; Haswell's
L1 DTLB can hold 32 of those which is plenty working set for the table in
this case. Even the 1024 entries in the L2 DTLB will only be enough to
cover 4MB with 4K pages (this is optimistic as it assumes no way conflicts).

2. If there is temporal locality in the string lookups, try putting a
small, bounded-size "cache" in front of the larger table. (e.g. even a
two-element cache for the offset lookup in
https://reviews.llvm.org/D20645#440638 made LLD -O1 8% faster). For
StringTableBuilder, a small hash table may be enough. This may also be
beneficial for avoiding cost from duplicates in the "pass in a size_t* to
add" approach.

3. For the "pass in a size_t* to add" approach, did you try parallelizing
the final merge step? Parallelizing (or in general optimizing) a single
large final operation should be much easier than trying to get parallelism
in other ways from this code. Applying 1. will probably be easiest in the
context of a large final merge operation as well. To mitigate TLB costs (if
that is a problem), doing a pseudo-sort (e.g. a quicksort with a bounded
recursion) of the hash values mod the table size may be effective,
especially if this can be leveraged to approximately partition the hash
table bucket array across cores (each of N cores then only has TableSize/N
bytes of working set in the table, which should give a superlinear speedup
from parallelization due to improved cache utilization). The
equidistribution of hash values has some nice properties that could be
leveraged for fast approximate partitioning / pseudo-sorting (at the very
least, it makes pivot selection easy, just HASH_T_MAX/2). In fact, you may
not even need to do any pseudo-sorting. Just assign each thread part of the
[0, TableSize) space and have all threads walk the entire array of values
to be inserted, but skip any that don't fall into its partition of the
table space.

[1] Something like this might be useful (it is critical that as many of the
potentially cache-missing operations fit into the reorder buffer as
possible, so that they can be simultaneously pending):

template <typename T, int N>
__attribute__((noinline))
void loadFromPointers(T **PtrArr, T *__restrict__ OutVals) {
  // The compiler should hopefully fully unroll these loops and fully SROA
all the fixed-size arrays.

  T *Ptrs[N];
  for (int i = 0; i < N; i++)
    Ptrs[i] = PtrArr[i];
  // If necessary, to avoid the compiler from moving instructions into the
critical loop.
  //asm("");

  T Vals[N];
  // This is the critical part. As many loads as possible must fit into the
reorder buffer.
  // This should hopefully compile into a bunch of sequential load
instructions with nothing in between.
  for (int i = 0; i < N; i++)
    Vals[i] = *Ptrs[i];

  // If necessary, to avoid the compiler from moving instructions into the
critical loop.
  //asm("");
  for (int i = 0; i < N; i++)
    OutVals[i] = Vals[i];
}

-- Sean Silva

On Fri, Oct 14, 2016 at 11:23 AM, Rafael Espíndola via llvm-commits <
llvm-commits at lists.llvm.org> wrote:

> I have put some effort to try to speed up StringTableBuilder. The last
> thing that worked was committed as r284249.
>
> The main difficulty in optimizing it is the large number of strings it
> has to handle. In the case of xul, one of the string tables has
> 14_375_801 strings added, out of which only 1_726_762 are unique.
>
> The things I tried:
>
> * Instead of returning size_t from add, pass in a size_t* to add. This
> allows us to remember all StringRefs and only do the merging in
> finalize. This would be extra helpful for the -O2 case by not needing
> an extra hash lookup. The idea for -O1 was to avoid hash resizing and
> improve cache hit by calling StringIndexMap.reserve. Unfortunately
> given how many strings are duplicated this is not profitable.
>
> * Using a DenseSet with just an unsigned in it and a side std::vector
> with the rest of the information. The idea is that the vector doesn't
> contain empty keys, so it should be denser. This reduced the cache
> misses accessing the set, but the extra vector compensated for it.
>
> * Creating the string buffer incrementally in add. The idea is that
> then we don't need to store the pointer to the string. We can find out
> what the string is with just the offset. This was probably the most
> promising. It reduced the total number of cache misses reported by
> perf stat, but the overall time didn't improve significantly. This
> also makes -O2 substantially harder to implement. I have attached the
> patch that implements this (note that -O2 is not implemented).
>
> * Not merging constants to see if special casing them would make a
> difference. No test speeds ups by even 1%.
>
> In summary, it seems the string merging at -O1 is as fast as it gets
> short of someone knowing a crazy algorithm. At -O2 it should be
> possible to avoid the second hash lookup by passing a size_t* to add,
> but it is not clear if that would be worth the code complexity.
>
> Cheers,
> Rafael
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161015/2281927c/attachment.html>