[llvm] r214135 - IR: Optimize size of use-list order shuffle vectors

Tue Aug 5 17:47:18 PDT 2014

On Mon, Aug 4, 2014 at 4:36 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:

> > On 2014-Jul-29, at 10:36, Duncan P. N. Exon Smith <dexonsmith at apple.com>
> wrote:
> >
> >>
> >> On 2014-Jul-29, at 10:19, Sean Silva <chisophugis at gmail.com> wrote:
> >>
> >> First, I agree with Chandler about not worrying unless this is in the
> profile.
> >>
> >> However, if this really does need to be optimized....
> >>
> >> Crazy idea: would it be possible to store just a single int's worth of
> RNG seed for each use list?
> >>
> >> A less crazy idea: a vector of indices is an extremely
> memory-inefficient way to store permutations. For example, there are 12!
> permutations of 12 elements, and 12! is less than 2^32. Similarly, there
> are 20! permutations of 20 elements and 20! < 2^64. Therefore your "small"
> case could theoretically be just a single `unsigned` from a storage
> perspective.
> >> A slightly memory-suboptimal but simple and cpu-friendly way to store
> the permutations in an integer would be to bit-pack the indices, using just
> as many bits for the indices as necessary. For example, suppose you were
> just allowed a single uint64. You could use the following arrangement to
> store permutations of up to 15 elements:
> >> Low 4 bits: number of elements (the "size")
> >> Each 4 bits after that: an index. Since we use 4 bits to store it,
> size() is at most 15, thus each index fits in 4 bits. 4 * 15 = 60, so that
> is just enough room for up to 15 elements.
> >> (there is actually room for quite a bit of out-of-band data; if size()
> < 15, then you have entire unused indices at the top and so you have 4*(15
> - size()) bits available)
> >>
> >
> > This makes a lot of sense to me.  I'm still trying to shake out some test
> > failures, but once I get to looking at memory overhead I think this is a
> good
> > direction.
>
> I took some time today to run `llvm-as` and `llvm-dis` on the LTO'ed IR
> for tablegen.  The bitcode is about 7.2MB on-disk.  This isn't really
> big, but at least it's not tiny.  For all of these, I have the
> `global_ctors` patch applied.
>
> First, I collected stats on three different data structures for
> `UseListShuffleVector`.
>
>  1. `times`: Currently committed "small vector", with a 6-element small
>     array of `unsigned`.
>
>  2. `times-packed`: Modified version of `times` that uses a 24-element
>     small array of `unsigned char` (the big array is still `unsigned`).
>     Same `sizeof()` as the 6-element array above, but more often small,
>     and also slightly more complex.
>
>  3. `times-stdvec`: `std::vector<unsigned>`.
>
> Here is the average user time and resident memory of each of these
> (average of 10 runs) for `llvm-as -preserve-bc-use-list-order`:
>
>     1.6211 151357849 times/preserve-as-*.profile
>     1.6278 151314841 times-packed/preserve-as-*.profile
>     1.6347 151341465 times-stdvec/preserve-as-*.profile
>
> The difference between these three versions is pretty noisy, but the
> `std::vector<>` version looks slightly slower.  I think it can be left
> as-is in the tree, but let me know if you think differently and/or want
> data from a bigger bitcode file.
>

This difference is definitely in the noise. Seems like optimizing this code
path is pointless. Just leave it with whichever one is simplest
(std::vector?).

>
> ----
>
> Note that the difference in memory overhead between these data
> structures doesn't seem important.  I ran `opt` on a bitcode file that
> had preserved a "shuffled" use-list order, running no passes but using
> `-preserve-bc-use-list-order`.
>
>     0.1189  14545715 shuffled.profile
>     0.1191  14537523 shuffled-packed.profile
>     0.1197  14701363 shuffled-stdvec.profile
>
> For reference, here are two more versions.  `packed8` has an 8-element
> array instead of 24-element, so `sizeof(UseListShuffleVector)` actually
> drops.  `nopreserve` is the current data structure but without
> `-preserve-bc-use-list-order`.
>
>     0.1184  14544486 shuffled-packed8.profile
>     0.0941  12574310 shuffled-nopreserve.profile
>
> Calculating the use-lists takes extra memory -- but none of these data
> structures has much effect on how much.
>

Do you have stats about the size distributions of use lists? E.g. a
histogram? (might want to use a log scale)
Also comparing said distributions across many different bitcode modules of
different sorts of code (e.g. chromium, firefox, clang, various cases in
the test-suite, etc.).

>
> ----
>
> Since I was collecting data anyway, I have some stats on overhead.
>
> `llvm-as`, but without `-preserve-bc-use-list-order`:
>
>     1.4869 151392256 times/nopreserve-as-*.profile
>     1.4979 151357440 times-packed/nopreserve-as-*.profile
>     1.4945 151396761 times-stdvec/nopreserve-as-*.profile
>
> For `llvm-dis`, here's reading a file generated by `llvm-as` without
> `-preserve-bc-use-list-order`, then with it, and then generated by
> `verify-uselistorder -save-temps` after shuffling use-lists.
>
>     0.7472 119365222 times/nopreserve-dis-*.profile
>     0.7958 121296896 times/preserve-dis-*.profile
>     0.8792 123158118 times-shuffled/preserve-dis-*.profile
>
> File-size of the input bitcode for same (in the same order):
>
>     7.2M nopreserve.bc
>     7.6M preserve.bc
>     8.3M shuffled.bc
>
>
Why does shuffling increase the filesize? Is there some "default" order
that is cheaper to store? How does the choice of permutation affect the
amount stored?

-- Sean Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140805/26f6977c/attachment.html>