<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 4, 2014 at 4:36 PM, Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">> On 2014-Jul-29, at 10:36, Duncan P. N. Exon Smith <<a href="mailto:dexonsmith@apple.com">dexonsmith@apple.com</a>> wrote:<br>
><br>
>><br>
>> On 2014-Jul-29, at 10:19, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>
>><br>
>> First, I agree with Chandler about not worrying unless this is in the profile.<br>
>><br>
>> However, if this really does need to be optimized....<br>
>><br>
>> Crazy idea: would it be possible to store just a single int's worth of RNG seed for each use list?<br>
>><br>
>> A less crazy idea: a vector of indices is an extremely memory-inefficient way to store permutations. For example, there are 12! permutations of 12 elements, and 12! is less than 2^32. Similarly, there are 20! permutations of 20 elements and 20! < 2^64. Therefore your "small" case could theoretically be just a single `unsigned` from a storage perspective.<br>
>> A slightly memory-suboptimal but simple and cpu-friendly way to store the permutations in an integer would be to bit-pack the indices, using just as many bits for the indices as necessary. For example, suppose you were just allowed a single uint64. You could use the following arrangement to store permutations of up to 15 elements:<br>
>> Low 4 bits: number of elements (the "size")<br>
>> Each 4 bits after that: an index. Since we use 4 bits to store it, size() is at most 15, thus each index fits in 4 bits. 4 * 15 = 60, so that is just enough room for up to 15 elements.<br>
>> (there is actually room for quite a bit of out-of-band data; if size() < 15, then you have entire unused indices at the top and so you have 4*(15 - size()) bits available)<br>
>><br>
><br>
> This makes a lot of sense to me. I'm still trying to shake out some test<br>
> failures, but once I get to looking at memory overhead I think this is a good<br>
> direction.<br>
<br>
</div></div>I took some time today to run `llvm-as` and `llvm-dis` on the LTO'ed IR<br>
for tablegen. The bitcode is about 7.2MB on-disk. This isn't really<br>
big, but at least it's not tiny. For all of these, I have the<br>
`global_ctors` patch applied.<br>
<br>
First, I collected stats on three different data structures for<br>
`UseListShuffleVector`.<br>
<br>
1. `times`: Currently committed "small vector", with a 6-element small<br>
array of `unsigned`.<br>
<br>
2. `times-packed`: Modified version of `times` that uses a 24-element<br>
small array of `unsigned char` (the big array is still `unsigned`).<br>
Same `sizeof()` as the 6-element array above, but more often small,<br>
and also slightly more complex.<br>
<br>
3. `times-stdvec`: `std::vector<unsigned>`.<br>
<br>
Here is the average user time and resident memory of each of these<br>
(average of 10 runs) for `llvm-as -preserve-bc-use-list-order`:<br>
<br>
1.6211 151357849 times/preserve-as-*.profile<br>
1.6278 151314841 times-packed/preserve-as-*.profile<br>
1.6347 151341465 times-stdvec/preserve-as-*.profile<br>
<br>
The difference between these three versions is pretty noisy, but the<br>
`std::vector<>` version looks slightly slower. I think it can be left<br>
as-is in the tree, but let me know if you think differently and/or want<br>
data from a bigger bitcode file.<br></blockquote><div><br></div><div>This difference is definitely in the noise. Seems like optimizing this code path is pointless. Just leave it with whichever one is simplest (std::vector?).</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
----<br>
<br>
Note that the difference in memory overhead between these data<br>
structures doesn't seem important. I ran `opt` on a bitcode file that<br>
had preserved a "shuffled" use-list order, running no passes but using<br>
`-preserve-bc-use-list-order`.<br>
<br>
0.1189 14545715 shuffled.profile<br>
0.1191 14537523 shuffled-packed.profile<br>
0.1197 14701363 shuffled-stdvec.profile<br>
<br>
For reference, here are two more versions. `packed8` has an 8-element<br>
array instead of 24-element, so `sizeof(UseListShuffleVector)` actually<br>
drops. `nopreserve` is the current data structure but without<br>
`-preserve-bc-use-list-order`.<br>
<br>
0.1184 14544486 shuffled-packed8.profile<br>
0.0941 12574310 shuffled-nopreserve.profile<br>
<br>
Calculating the use-lists takes extra memory -- but none of these data<br>
structures has much effect on how much.<br></blockquote><div><br></div><div>Do you have stats about the size distributions of use lists? E.g. a histogram? (might want to use a log scale)</div><div>Also comparing said distributions across many different bitcode modules of different sorts of code (e.g. chromium, firefox, clang, various cases in the test-suite, etc.).</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
----<br>
<br>
Since I was collecting data anyway, I have some stats on overhead.<br>
<br>
`llvm-as`, but without `-preserve-bc-use-list-order`:<br>
<br>
1.4869 151392256 times/nopreserve-as-*.profile<br>
1.4979 151357440 times-packed/nopreserve-as-*.profile<br>
1.4945 151396761 times-stdvec/nopreserve-as-*.profile<br>
<br>
For `llvm-dis`, here's reading a file generated by `llvm-as` without<br>
`-preserve-bc-use-list-order`, then with it, and then generated by<br>
`verify-uselistorder -save-temps` after shuffling use-lists.<br>
<br>
0.7472 119365222 times/nopreserve-dis-*.profile<br>
0.7958 121296896 times/preserve-dis-*.profile<br>
0.8792 123158118 times-shuffled/preserve-dis-*.profile<br>
<br>
File-size of the input bitcode for same (in the same order):<br>
<br>
7.2M nopreserve.bc<br>
7.6M preserve.bc<br>
8.3M shuffled.bc<br>
<br></blockquote><div><br></div><div>Why does shuffling increase the filesize? Is there some "default" order that is cheaper to store? How does the choice of permutation affect the amount stored?</div><div><br>
</div><div>-- Sean Silva</div></div><br></div></div>