[llvm] r214135 - IR: Optimize size of use-list order shuffle vectors

Tue Jul 29 10:21:02 PDT 2014

On Tue, Jul 29, 2014 at 10:15 AM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:
>
>> On 2014-Jul-29, at 09:23, David Blaikie <dblaikie at gmail.com> wrote:
>>
>> On Mon, Jul 28, 2014 at 3:41 PM, Duncan P. N. Exon Smith
>> <dexonsmith at apple.com> wrote:
>>> Author: dexonsmith
>>> Date: Mon Jul 28 17:41:50 2014
>>> New Revision: 214135
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=214135&view=rev
>>> Log:
>>> IR: Optimize size of use-list order shuffle vectors
>>>
>>> Since we're storing lots of these, save two-pointers per vector
>>
>> It saves two pointers but costs one integer, right? At the cost of
>> never having a capacity that differs from the size, which means more
>> reallocation overhead? And I see there is no reallocation.. so that
>> explains it then.
>
> Yup.  These are calculated once and then processed later.
>
> Also, it saves 3 pointers but costs one integer (2 pointers overall).
>
>   - sizeof(ShuffleVector) == 32
>   - sizeof(SmallVector<unsigned, 6>) == 48

Yep yep - eventually I saw the union you were using. At the cost of a
conditional for every access (to decide which side of the union to
look at), which might not be great - but again, I have no intuition
about the relative tradeoff there.

>
>> +Chandler: This is where I want std::dynamic_array.
>>
>>> with a
>>> custom type rather than using the relatively heavy `SmallVector`.
>>>
>>> Part of PR5680.
>>>
>>> Modified:
>>>    llvm/trunk/include/llvm/IR/UseListOrder.h
>>>    llvm/trunk/lib/Bitcode/Writer/ValueEnumerator.cpp
>>>
>>> Modified: llvm/trunk/include/llvm/IR/UseListOrder.h
>>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/IR/UseListOrder.h?rev=214135&r1=214134&r2=214135&view=diff
>>> ==============================================================================
>>> --- llvm/trunk/include/llvm/IR/UseListOrder.h (original)
>>> +++ llvm/trunk/include/llvm/IR/UseListOrder.h Mon Jul 28 17:41:50 2014
>>> @@ -25,11 +25,58 @@ class Module;
>>> class Function;
>>> class Value;
>>>
>>> +/// \brief Structure to hold a use-list shuffle vector.
>>> +///
>>> +/// Stores most use-lists locally, but large use-lists use an extra heap entry.
>>> +/// Costs two fewer pointers than the equivalent \a SmallVector.
>>> +class UseListShuffleVector {
>>> +  unsigned Size;
>>> +  union {
>>> +    unsigned *Ptr;
>>> +    unsigned Array[6];
>>> +  } Storage;
>>> +
>>> +  bool isSmall() const { return Size <= 6; }
>>> +  unsigned *data() { return isSmall() ? Storage.Array : Storage.Ptr; }
>>> +  const unsigned *data() const {
>>> +    return isSmall() ? Storage.Array : Storage.Ptr;
>>> +  }
>>> +
>>> +public:
>>> +  UseListShuffleVector() : Size(0) {}
>>> +  UseListShuffleVector(UseListShuffleVector &&X) {
>>> +    std::memcpy(this, &X, sizeof(UseListShuffleVector));
>>> +    X.Size = 0;
>>> +  }
>>> +  explicit UseListShuffleVector(size_t Size) : Size(Size) {
>>> +    if (!isSmall())
>>> +      Storage.Ptr = new unsigned[Size];
>>> +  }
>>> +  ~UseListShuffleVector() {
>>> +    if (!isSmall())
>>> +      delete Storage.Ptr;
>>> +  }
>>> +
>>> +  typedef unsigned *iterator;
>>> +  typedef const unsigned *const_iterator;
>>> +
>>> +  size_t size() const { return Size; }
>>> +  iterator begin() { return data(); }
>>> +  iterator end() { return begin() + size(); }
>>> +  const_iterator begin() const { return data(); }
>>> +  const_iterator end() const { return begin() + size(); }
>>> +  unsigned &operator[](size_t I) { return data()[I]; }
>>> +  unsigned operator[](size_t I) const { return data()[I]; }
>>> +};
>>> +
>>> /// \brief Structure to hold a use-list order.
>>> struct UseListOrder {
>>> -  const Function *F;
>>>   const Value *V;
>>> -  SmallVector<unsigned, 8> Shuffle;
>>> +  const Function *F;
>>> +  UseListShuffleVector Shuffle;
>>> +
>>> +  UseListOrder(const Value *V, const Function *F, size_t ShuffleSize)
>>> +      : V(V), F(F), Shuffle(ShuffleSize) {}
>>> };
>>>
>>> typedef std::vector<UseListOrder> UseListOrderStack;
>>>
>>> Modified: llvm/trunk/lib/Bitcode/Writer/ValueEnumerator.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Bitcode/Writer/ValueEnumerator.cpp?rev=214135&r1=214134&r2=214135&view=diff
>>> ==============================================================================
>>> --- llvm/trunk/lib/Bitcode/Writer/ValueEnumerator.cpp (original)
>>> +++ llvm/trunk/lib/Bitcode/Writer/ValueEnumerator.cpp Mon Jul 28 17:41:50 2014
>>> @@ -133,12 +133,11 @@ static void predictValueUseListOrderImpl
>>>     return;
>>>
>>>   // Store the shuffle.
>>> -  UseListOrder O;
>>> -  O.V = V;
>>> -  O.F = F;
>>> -  for (auto &I : List)
>>> -    O.Shuffle.push_back(I.second);
>>> -  Stack.push_back(O);
>>> +  UseListOrder O(V, F, List.size());
>>> +  assert(List.size() == O.Shuffle.size() && "Wrong size");
>>> +  for (size_t I = 0, E = List.size(); I != E; ++I)
>>> +    O.Shuffle[I] = List[I].second;
>>> +  Stack.emplace_back(std::move(O));
>>
>> Generally I'd suggest skipping emplace_back where push_back will have
>> the same performance characteristics (since you don't need to call
>> anything other than the move ctor, which push_back can do just fine)
>>
>> But given the use of the small-size optimization, pushing back the
>> result isn't ideal either, it's probably better to allocate the
>> element in the Stack then initialize it there so you don't have to
>> copy the small portion over:
>>
>>  // Is there an easy way to make a sequence one larger and return a
>> reference to that new element? It seems such a common operation...
>>  Stack.resize(Stack.size() + 1);
>>  UseListOrder &O = Stack.back();
>>  ...
>>
>
> Fixed in r214184.
>
>> In addition/alternatively, I'm not sure how valuable the small-size
>> optimization is if all of them are going in another container anyway.
>> Usually the small optimization is most valuable when it avoids malloc
>> allocation entirely (by using stack space) but here each element in
>> Stack will have the small footprint in use - so wasted small buffers
>> will be costing heap space...
>
> Every use list shuffle will be at least size 2 (otherwise, there's
> nothing to shuffle).  If mallocs had 0B memory overhead,
> `std::vector<unsigned>` of size 2 would cost 32B.  Since mallocs
> actually do have overhead (usually at least 16B), the common case (2-6
> items) will take less memory.

Right - but for every element that's bigger than the small space, you
still pay the small in the Stack, so it does become a bit of a matter
of size statistics as to where the sweet spot is.

>
>> (Chandler, again - thoughts on this tradeoff? I'm perhaps too readily
>> open to avoiding premature optimization and thus avoid small-size
>> optimizations until there's a really good reason to add them, but I
>> haven't done as much profiling/performance work to have an intuition
>> about when they're not premature and just good practice)
>>
>>
>>> }
>>>
>>> static void predictValueUseListOrder(const Value *V, const Function *F,
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>