<div dir="ltr">3 suggestions:<div><br><div>1.  If this is constantly stalling on dcache misses (which it sounds like it is), try making the methods be able to do operation on many strings at once, then for the critical cache-missing memory accesses, issue them all very near each other in the instruction stream so that the CPU can pipeline all the memory requests[1]. E.g. Haswell's L2 can keep 16 simultaneous outstanding misses.</div><div><br></div><div>I suspect that the hardware page table walker can't do multiple parallel requests, so this may not help if you are bottlenecked on that. E.g. Haswell's L1 DTLB can hold 64 4K pages, so about 256KB of working set. The numbers you quoted above for the number of unique strings in the firefox test case suggests about 16MB table size assuming 8 bytes for each entry (which is conservative; the keys are larger), so this may be an issue, considering the highly random nature of hash lookups. If that is a problem, try allocating the hash table in 2MB pages if it isn't already; Haswell's L1 DTLB can hold 32 of those which is plenty working set for the table in this case. Even the 1024 entries in the L2 DTLB will only be enough to cover 4MB with 4K pages (this is optimistic as it assumes no way conflicts).</div><div><br></div><div>2. If there is temporal locality in the string lookups, try putting a small, bounded-size "cache" in front of the larger table. (e.g. even a two-element cache for the offset lookup in <a href="https://reviews.llvm.org/D20645#440638">https://reviews.llvm.org/D20645#440638</a> made LLD -O1 8% faster). For StringTableBuilder, a small hash table may be enough. This may also be beneficial for avoiding cost from duplicates in the "pass in a size_t* to add" approach.</div><div><br></div><div>3. For the "pass in a size_t* to add" approach, did you try parallelizing the final merge step? Parallelizing (or in general optimizing) a single large final operation should be much easier than trying to get parallelism in other ways from this code. Applying 1. will probably be easiest in the context of a large final merge operation as well. To mitigate TLB costs (if that is a problem), doing a pseudo-sort (e.g. a quicksort with a bounded recursion) of the hash values mod the table size may be effective, especially if this can be leveraged to approximately partition the hash table bucket array across cores (each of N cores then only has TableSize/N bytes of working set in the table, which should give a superlinear speedup from parallelization due to improved cache utilization). The equidistribution of hash values has some nice properties that could be leveraged for fast approximate partitioning / pseudo-sorting (at the very least, it makes pivot selection easy, just HASH_T_MAX/2). In fact, you may not even need to do any pseudo-sorting. Just assign each thread part of the [0, TableSize) space and have all threads walk the entire array of values to be inserted, but skip any that don't fall into its partition of the table space.</div><div><br></div><div><br></div><div>[1] Something like this might be useful (it is critical that as many of the potentially cache-missing operations fit into the reorder buffer as possible, so that they can be simultaneously pending):</div><div><br></div><div><div>template <typename T, int N></div><div>__attribute__((noinline))</div><div>void loadFromPointers(T **PtrArr, T *__restrict__ OutVals) {</div><div>  // The compiler should hopefully fully unroll these loops and fully SROA all the fixed-size arrays.</div><div><br></div><div>  T *Ptrs[N];</div><div>  for (int i = 0; i < N; i++)</div><div>    Ptrs[i] = PtrArr[i];</div><div>  // If necessary, to avoid the compiler from moving instructions into the critical loop.</div><div>  //asm("");</div><div><br></div><div>  T Vals[N];</div><div>  // This is the critical part. As many loads as possible must fit into the reorder buffer.</div><div>  // This should hopefully compile into a bunch of sequential load instructions with nothing in between.</div><div>  for (int i = 0; i < N; i++)</div><div>    Vals[i] = *Ptrs[i];</div><div><br></div><div>  // If necessary, to avoid the compiler from moving instructions into the critical loop.</div><div>  //asm("");</div><div>  for (int i = 0; i < N; i++)</div><div>    OutVals[i] = Vals[i];</div><div>}</div></div><div><br></div><div>-- Sean Silva</div><div><br></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 14, 2016 at 11:23 AM, Rafael Espíndola via llvm-commits <span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I have put some effort to try to speed up StringTableBuilder. The last<br>

thing that worked was committed as r284249.<br>

<br>

The main difficulty in optimizing it is the large number of strings it<br>

has to handle. In the case of xul, one of the string tables has<br>

14_375_801 strings added, out of which only 1_726_762 are unique.<br>

<br>

The things I tried:<br>

<br>

* Instead of returning size_t from add, pass in a size_t* to add. This<br>

allows us to remember all StringRefs and only do the merging in<br>

finalize. This would be extra helpful for the -O2 case by not needing<br>

an extra hash lookup. The idea for -O1 was to avoid hash resizing and<br>

improve cache hit by calling StringIndexMap.reserve. Unfortunately<br>

given how many strings are duplicated this is not profitable.<br>

<br>

* Using a DenseSet with just an unsigned in it and a side std::vector<br>

with the rest of the information. The idea is that the vector doesn't<br>

contain empty keys, so it should be denser. This reduced the cache<br>

misses accessing the set, but the extra vector compensated for it.<br>

<br>

* Creating the string buffer incrementally in add. The idea is that<br>

then we don't need to store the pointer to the string. We can find out<br>

what the string is with just the offset. This was probably the most<br>

promising. It reduced the total number of cache misses reported by<br>

perf stat, but the overall time didn't improve significantly. This<br>

also makes -O2 substantially harder to implement. I have attached the<br>

patch that implements this (note that -O2 is not implemented).<br>

<br>

* Not merging constants to see if special casing them would make a<br>

difference. No test speeds ups by even 1%.<br>

<br>

In summary, it seems the string merging at -O1 is as fast as it gets<br>

short of someone knowing a crazy algorithm. At -O2 it should be<br>

possible to avoid the second hash lookup by passing a size_t* to add,<br>

but it is not clear if that would be worth the code complexity.<br>

<br>

Cheers,<br>

Rafael<br>

<br>______________________________<wbr>_________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>

<br></blockquote></div><br></div>