<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 4, 2014 at 4:36 PM, Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">> On 2014-Jul-29, at 10:36, Duncan P. N. Exon Smith <<a href="mailto:dexonsmith@apple.com">dexonsmith@apple.com</a>> wrote:<br>


><br>

>><br>

>> On 2014-Jul-29, at 10:19, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>

>><br>

>> First, I agree with Chandler about not worrying unless this is in the profile.<br>

>><br>

>> However, if this really does need to be optimized....<br>

>><br>

>> Crazy idea: would it be possible to store just a single int's worth of RNG seed for each use list?<br>

>><br>

>> A less crazy idea: a vector of indices is an extremely memory-inefficient way to store permutations. For example, there are 12! permutations of 12 elements, and 12! is less than 2^32. Similarly, there are 20! permutations of 20 elements and 20! < 2^64. Therefore your "small" case could theoretically be just a single `unsigned` from a storage perspective.<br>


>> A slightly memory-suboptimal but simple and cpu-friendly way to store the permutations in an integer would be to bit-pack the indices, using just as many bits for the indices as necessary. For example, suppose you were just allowed a single uint64. You could use the following arrangement to store permutations of up to 15 elements:<br>


>> Low 4 bits: number of elements (the "size")<br>

>> Each 4 bits after that: an index. Since we use 4 bits to store it, size() is at most 15, thus each index fits in 4 bits. 4 * 15 = 60, so that is just enough room for up to 15 elements.<br>

>> (there is actually room for quite a bit of out-of-band data; if size() < 15, then you have entire unused indices at the top and so you have 4*(15 - size()) bits available)<br>

>><br>

><br>

> This makes a lot of sense to me.  I'm still trying to shake out some test<br>

> failures, but once I get to looking at memory overhead I think this is a good<br>

> direction.<br>

<br>

</div></div>I took some time today to run `llvm-as` and `llvm-dis` on the LTO'ed IR<br>

for tablegen.  The bitcode is about 7.2MB on-disk.  This isn't really<br>

big, but at least it's not tiny.  For all of these, I have the<br>

`global_ctors` patch applied.<br>

<br>

First, I collected stats on three different data structures for<br>

`UseListShuffleVector`.<br>

<br>

 1. `times`: Currently committed "small vector", with a 6-element small<br>

    array of `unsigned`.<br>

<br>

 2. `times-packed`: Modified version of `times` that uses a 24-element<br>

    small array of `unsigned char` (the big array is still `unsigned`).<br>

    Same `sizeof()` as the 6-element array above, but more often small,<br>

    and also slightly more complex.<br>

<br>

 3. `times-stdvec`: `std::vector<unsigned>`.<br>

<br>

Here is the average user time and resident memory of each of these<br>

(average of 10 runs) for `llvm-as -preserve-bc-use-list-order`:<br>

<br>

    1.6211 151357849 times/preserve-as-*.profile<br>

    1.6278 151314841 times-packed/preserve-as-*.profile<br>

    1.6347 151341465 times-stdvec/preserve-as-*.profile<br>

<br>

The difference between these three versions is pretty noisy, but the<br>

`std::vector<>` version looks slightly slower.  I think it can be left<br>

as-is in the tree, but let me know if you think differently and/or want<br>

data from a bigger bitcode file.<br></blockquote><div><br></div><div>This difference is definitely in the noise. Seems like optimizing this code path is pointless. Just leave it with whichever one is simplest (std::vector?).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

----<br>

<br>

Note that the difference in memory overhead between these data<br>

structures doesn't seem important.  I ran `opt` on a bitcode file that<br>

had preserved a "shuffled" use-list order, running no passes but using<br>

`-preserve-bc-use-list-order`.<br>

<br>

    0.1189  14545715 shuffled.profile<br>

    0.1191  14537523 shuffled-packed.profile<br>

    0.1197  14701363 shuffled-stdvec.profile<br>

<br>

For reference, here are two more versions.  `packed8` has an 8-element<br>

array instead of 24-element, so `sizeof(UseListShuffleVector)` actually<br>

drops.  `nopreserve` is the current data structure but without<br>

`-preserve-bc-use-list-order`.<br>

<br>

    0.1184  14544486 shuffled-packed8.profile<br>

    0.0941  12574310 shuffled-nopreserve.profile<br>

<br>

Calculating the use-lists takes extra memory -- but none of these data<br>

structures has much effect on how much.<br></blockquote><div><br></div><div>Do you have stats about the size distributions of use lists? E.g. a histogram? (might want to use a log scale)</div><div>Also comparing said distributions across many different bitcode modules of different sorts of code (e.g. chromium, firefox, clang, various cases in the test-suite, etc.).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

----<br>

<br>

Since I was collecting data anyway, I have some stats on overhead.<br>

<br>

`llvm-as`, but without `-preserve-bc-use-list-order`:<br>

<br>

    1.4869 151392256 times/nopreserve-as-*.profile<br>

    1.4979 151357440 times-packed/nopreserve-as-*.profile<br>

    1.4945 151396761 times-stdvec/nopreserve-as-*.profile<br>

<br>

For `llvm-dis`, here's reading a file generated by `llvm-as` without<br>

`-preserve-bc-use-list-order`, then with it, and then generated by<br>

`verify-uselistorder -save-temps` after shuffling use-lists.<br>

<br>

    0.7472 119365222 times/nopreserve-dis-*.profile<br>

    0.7958 121296896 times/preserve-dis-*.profile<br>

    0.8792 123158118 times-shuffled/preserve-dis-*.profile<br>

<br>

File-size of the input bitcode for same (in the same order):<br>

<br>

    7.2M nopreserve.bc<br>

    7.6M preserve.bc<br>

    8.3M shuffled.bc<br>

<br></blockquote><div><br></div><div>Why does shuffling increase the filesize? Is there some "default" order that is cheaper to store? How does the choice of permutation affect the amount stored?</div><div><br>

</div><div>-- Sean Silva</div></div><br></div></div>