Attempts at speeding up StringTableBuilder

Sun Oct 16 16:52:55 PDT 2016

On Sun, Oct 16, 2016 at 7:03 AM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:

> Thanks. For 1 I need to pass a size_t*, no?
>

In theory, clients could pass in multiple keys simultaneously, but I
suspect that it is better to encapsulate the batching of insertions inside
StringTableBuilder. Passing in size_t* makes that possible. (note however,
that when writing the offsets back out through the size_t*'s, the access
pattern is likely to be fairly random, so you probably also want to make
sure that the cache misses for all those stores get pipelined too; same
goes for memcpying all the strings into the output, though that will likely
be quite TLB bottlenecked)

>
> I did try sorting by hash. It avoided the misses in the hash insertion,
> but was not faster overall.
>

Well, if you have already sorted by hash, then you don't need to insert
into the hash table since all values with the same hash are already
adjacent.
I think that one useful way to look at the problem is that there is a
spectrum of options between:
1. Inserting everything into a hash table to do the uniquing.
2. Doing a full sort + unique

Hash table insertion has bad spatial access patterns. Sorting has very good
spatial access patterns, but does log(N) more work overall (for some
definition of work that doesn't take into account spatial locality).
The bad spatial access patterns of the hash table however are not too
problematic if the working set fits in various DCaches and DTLB's.

So the net result is that for large enough tables (and I think that the
firefox one you showed is definitely large enough), the best approach will
probably do some amount of sorting to increase locality for the hash table
insertion. There's at least an order of magnitude difference in cost
between:

- random access into a table that is bigger than even L3 cache and DTLB
(assuming 4K pages) working set. Each memory access has a TLB miss with
decent probability, requiring a page table lookup (which may not be in
cache), then (serially dependent on that), a lookup in DRAM for the data
itself. We are easily talking many hundreds of cycles. Also, this is not
very parallelizable. Even a Xeon is only going to have like 4 memory
controllers (and the L3 is partitioned on a ring bus which can get
congested); you can't have too many cores pounding on that with a bunch of
random access requests.

- random access into each core's L2. IIRC this is only like 10-15 cycle
latency on Haswell, and the size means that if you are constantly hitting
in L2 then you are hitting in L1DTLB just fine. And each core's L2 is
private (although shared across hyperthreads). Net result is that: a) you
can scale to as many cores as you want and there won't be contention on the
L3 or memory controllers (or other processes running during the build!) b)
with appropriate ILP exposing the parallellism of load requests to the
processor, you can get to even 1 load per cycle throughput.

So doing some amount of partitioning to increase locality can pay off
hugely. If you still have the code around that did the full sort by hash,
one quick experiment might be to replace the sort call with:
- no sort
- partition by Hash % TableSize around TableSize/2
- partition by Hash % TableSize around TableSize/2, then recurse another
level with partition values TableSize/4 and 3*(TableSize/4)
- ... other levels maybe ...
- full sort

I.e. using something like this and trying NumLevelsToParitition = 0, 1, 2
for starters.

template <typename T> // T is assumed to have a `.Hash` member.
void bounded_recursion_partition_by_hash_impl(T *B, T *E, size_t TableSize,
size_t PartitionAround, int NumLevelsToParitition) {
  if (NumLevelsToPartition == 0)
    return;
  // Since the actual hash table lookup masks off the high bits, just doing
e.g. `Elem.Hash < HASH_T_MAX/2` at the first level doesn't actually
increase locality.
  T *Mid = std::partition(B, E, [=](const T &Elem) { return Elem.Hash &
(TableSize - 1) < PartitionAround; });
  bounded_recursion_partition_by_hash_impl(B, Mid, PartitionAround/2,
NumLevelsToParitition - 1);
  bounded_recursion_partition_by_hash_impl(Mid, E, 3 * (PartitionAround/2),
NumLevelsToParitition - 1);
}

template <typename T> // T is assumed to have a `.Hash` member.
void bounded_recursion_partition_by_hash(T *B, T *E, size_t TableSize, int
NumLevelsToParitition) {
  // Assume TableSize is a power of 2.
  assert(TableSize & (TableSize - 1) == 0 && "TableSize is not a power of
2!")
  bounded_recursion_partition_by_hash_impl(B, E, TableSize, TableSize / 2,
NumLevelsToParitition)
}

>
> I don't expect any temporal locally.
>

> I might give the large page idea a try.
>

That might be the biggest improvement for least effort for this code (and
other parts of LLD probably).

-- Sean Silva

>
> Cheers,
> Rafael
>
>
> On October 16, 2016 1:06:55 AM EDT, Sean Silva <chisophugis at gmail.com>
> wrote:
>>
>> 3 suggestions:
>>
>> 1.  If this is constantly stalling on dcache misses (which it sounds like
>> it is), try making the methods be able to do operation on many strings at
>> once, then for the critical cache-missing memory accesses, issue them all
>> very near each other in the instruction stream so that the CPU can pipeline
>> all the memory requests[1]. E.g. Haswell's L2 can keep 16 simultaneous
>> outstanding misses.
>>
>> I suspect that the hardware page table walker can't do multiple parallel
>> requests, so this may not help if you are bottlenecked on that. E.g.
>> Haswell's L1 DTLB can hold 64 4K pages, so about 256KB of working set. The
>> numbers you quoted above for the number of unique strings in the firefox
>> test case suggests about 16MB table size assuming 8 bytes for each entry
>> (which is conservative; the keys are larger), so this may be an issue,
>> considering the highly random nature of hash lookups. If that is a problem,
>> try allocating the hash table in 2MB pages if it isn't already; Haswell's
>> L1 DTLB can hold 32 of those which is plenty working set for the table in
>> this case. Even the 1024 entries in the L2 DTLB will only be enough to
>> cover 4MB with 4K pages (this is optimistic as it assumes no way conflicts).
>>
>> 2. If there is temporal locality in the string lookups, try putting a
>> small, bounded-size "cache" in front of the larger table. (e.g. even a
>> two-element cache for the offset lookup in https://reviews.llvm.org/
>> D20645#440638 made LLD -O1 8% faster). For StringTableBuilder, a small
>> hash table may be enough. This may also be beneficial for avoiding cost
>> from duplicates in the "pass in a size_t* to add" approach.
>>
>> 3. For the "pass in a size_t* to add" approach, did you try parallelizing
>> the final merge step? Parallelizing (or in general optimizing) a single
>> large final operation should be much easier than trying to get parallelism
>> in other ways from this code. Applying 1. will probably be easiest in the
>> context of a large final merge operation as well. To mitigate TLB costs (if
>> that is a problem), doing a pseudo-sort (e.g. a quicksort with a bounded
>> recursion) of the hash values mod the table size may be effective,
>> especially if this can be leveraged to approximately partition the hash
>> table bucket array across cores (each of N cores then only has TableSize/N
>> bytes of working set in the table, which should give a superlinear speedup
>> from parallelization due to improved cache utilization). The
>> equidistribution of hash values has some nice properties that could be
>> leveraged for fast approximate partitioning / pseudo-sorting (at the very
>> least, it makes pivot selection easy, just HASH_T_MAX/2). In fact, you may
>> not even need to do any pseudo-sorting. Just assign each thread part of the
>> [0, TableSize) space and have all threads walk the entire array of values
>> to be inserted, but skip any that don't fall into its partition of the
>> table space.
>>
>>
>> [1] Something like this might be useful (it is critical that as many of
>> the potentially cache-missing operations fit into the reorder buffer as
>> possible, so that they can be simultaneously pending):
>>
>> template <typename T, int N>
>> __attribute__((noinline))
>> void loadFromPointers(T **PtrArr, T *__restrict__ OutVals) {
>>   // The compiler should hopefully fully unroll these loops and fully
>> SROA all the fixed-size arrays.
>>
>>   T *Ptrs[N];
>>   for (int i = 0; i < N; i++)
>>     Ptrs[i] = PtrArr[i];
>>   // If necessary, to avoid the compiler from moving instructions into
>> the critical loop.
>>   //asm("");
>>
>>   T Vals[N];
>>   // This is the critical part. As many loads as possible must fit into
>> the reorder buffer.
>>   // This should hopefully compile into a bunch of sequential load
>> instructions with nothing in between.
>>   for (int i = 0; i < N; i++)
>>     Vals[i] = *Ptrs[i];
>>
>>   // If necessary, to avoid the compiler from moving instructions into
>> the critical loop.
>>   //asm("");
>>   for (int i = 0; i < N; i++)
>>     OutVals[i] = Vals[i];
>> }
>>
>> -- Sean Silva
>>
>>
>>
>> On Fri, Oct 14, 2016 at 11:23 AM, Rafael Espíndola via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> I have put some effort to try to speed up StringTableBuilder. The last
>>> thing that worked was committed as r284249.
>>>
>>> The main difficulty in optimizing it is the large number of strings it
>>> has to handle. In the case of xul, one of the string tables has
>>> 14_375_801 strings added, out of which only 1_726_762 are unique.
>>>
>>> The things I tried:
>>>
>>> * Instead of returning size_t from add, pass in a size_t* to add. This
>>> allows us to remember all StringRefs and only do the merging in
>>> finalize. This would be extra helpful for the -O2 case by not needing
>>> an extra hash lookup. The idea for -O1 was to avoid hash resizing and
>>> improve cache hit by calling StringIndexMap.reserve. Unfortunately
>>> given how many strings are duplicated this is not profitable.
>>>
>>> * Using a DenseSet with just an unsigned in it and a side std::vector
>>> with the rest of the information. The idea is that the vector doesn't
>>> contain empty keys, so it should be denser. This reduced the cache
>>> misses accessing the set, but the extra vector compensated for it.
>>>
>>> * Creating the string buffer incrementally in add. The idea is that
>>> then we don't need to store the pointer to the string. We can find out
>>> what the string is with just the offset. This was probably the most
>>> promising. It reduced the total number of cache misses reported by
>>> perf stat, but the overall time didn't improve significantly. This
>>> also makes -O2 substantially harder to implement. I have attached the
>>> patch that implements this (note that -O2 is not implemented).
>>>
>>> * Not merging constants to see if special casing them would make a
>>> difference. No test speeds ups by even 1%.
>>>
>>> In summary, it seems the string merging at -O1 is as fast as it gets
>>> short of someone knowing a crazy algorithm. At -O2 it should be
>>> possible to avoid the second hash lookup by passing a size_t* to add,
>>> but it is not clear if that would be worth the code complexity.
>>>
>>> Cheers,
>>> Rafael
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161016/3facb3d2/attachment.html>