[RFC] SortedVector ADT

Justin Bogner mail at justinbogner.com
Fri Jul 31 09:34:02 PDT 2015


Puyan Lotfi <puyan at puyan.org> writes:
> On Fri, Jul 31, 2015 at 6:27 AM, Aaron Ballman <aaron at aaronballman.com>
> wrote:
>
>> On Wed, Jul 29, 2015 at 8:40 PM, Puyan Lotfi <plotfi at apple.com> wrote:
>> > Justin, Aaron:
>> >
>> > These are the revised patches.
>> > I have tried to address all of your feedback with the exception of the
>> naming of the ADT.
>> > I was thinking, if I did rename it I could go with SortedVectorSet or
>> SortedSet.
>> >
>> > Anyways, I have added:
>> >
>> > - Better documentation changes that try to be more clear on how and when
>> to use SortedVector.
>> > - Better comments.
>> > - Removed erase().
>> > - Replaced reset with clear.
>> > - Renamed insert to push_back
>> > - Added unit testing.
>> >
>> > Hope this is enough to put my patch back in; please let me know if I
>> missed anything.
>>
>> Comments below. Btw, I think this discussion should be on its own
>> thread instead of attached to a commit revert message. I suspect there
>> are others in the community who have opinions on this ADT, but are
>> unaware of the discussion due to context. I've taken the liberty of
>> starting a new email thread for this, and included a few usual
>> suspects who review ADTs.
>>
>> For anyone joining in the middle, this is a discussion of an ADT Puyan
>> proposed in:
>>
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150720/288930.html
>>
>> That was commit (and eventually reverted) in:
>>
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150727/289945.html
>>
>> > diff --git a/include/llvm/ADT/SortedVector.h
>> b/include/llvm/ADT/SortedVector.h
>> > new file mode 100644
>> > index 0000000..ddafd6b
>> > --- /dev/null
>> > +++ b/include/llvm/ADT/SortedVector.h
>> > @@ -0,0 +1,137 @@
>> > +//===-- llvm/ADT/SortedVector.h ---------------------------------*- C++
>> -*-===//
>> > +//
>> > +//                     The LLVM Compiler Infrastructure
>> > +//
>> > +// This file is distributed under the University of Illinois Open Source
>> > +// License. See LICENSE.TXT for details.
>> > +//
>> >
>> +//===----------------------------------------------------------------------===//
>> > +
>> > +#ifndef LLVM_ADT_SORTEDVECTOR_H
>> > +#define LLVM_ADT_SORTEDVECTOR_H
>> > +
>> > +#include <vector>
>> > +#include <cassert>
>> > +#include <functional>
>> > +
>> > +namespace llvm {
>> > +
>> > +/// \brief A vector with minimal append overhead that enforces a sorted
>> > +/// ordering of unique elements prior to access. The most common use
>> case
>> > +/// is to push_back multiple Entries and call sortUnique afterward.
>>
>> I am wondering why this should be an ADT instead of one-off
>> functionality. I see that you're going to use it for
>> MachineBasicBlock, but are there other use cases as well? The kind of
>> functionality that this class provides is covered by basic STL
>> mechanisms, and so I am hesitant to support it as an ADT for all of
>> our users unless this is something we expect will be heavily used (or
>> is heavily customized to support toolchain needs).
>>
>>
> Originally I had added a patch to improve compile time in VirtRegRewriter,
> and me and Quentin had noticed that these LiveIns set in MachineBasicBlock
> we not actually in sorted order in all cases. We thought this could be a
> good way to maintain the ordering for the LiveIns set with low overhead
> since they were mostly sorted as a result of how they were being inserted
> into. Does a new ADT really need more than one client out the gate?
>
>
>> Also, the comments do not specify what ordering it is sorted by (we
>> can presume from the std::less, but it would be good to be explicit
>> since this is user-facing documentation).
>>
>
> Noted. I can add that comment.
>
>
>>
>> > +template<typename T, typename CMP = std::less<T>>
>> > +class SortedVector {
>> > +private:
>> > +  typedef T value_type;
>> > +  typedef typename std::vector<value_type> vector_type;
>> > +  typedef typename vector_type::const_iterator const_iterator;
>> > +  typedef typename vector_type::iterator iterator;
>> > +  typedef typename vector_type::size_type size_type;
>>
>> These typedefs should be public, aside from vector_type.
>>
>>
> I must have mistaken the previous feedback, and thought you wanted them all
> private. Sorry.
>
>
>> Missing typedefs for difference_type, reference and const_reference.
>>
>>
> I didn't see any of this in say, UniqueVector. But I see it they are in
> SetVector. Yeah, I can certainly go add those.
>
>
>> > +
>> > +  vector_type Vector;
>> > +  bool IsSorted = true;
>> > +
>> > +  void AssertIsSorted() const {
>> > +    assert(IsSorted && "Unsorted SortedVector access; call sortUnique
>> prior.");
>> > +  }
>> > +
>> > +public:
>> > +  /// \brief Appends Entry to the end of Vector; avoids appending
>> duplicate
>> > +  /// entries. If appending Entry causes Vector to be unsorted, then
>> sortUnique
>> > +  /// must be called prior to access.
>>
>> The public comments are still discussing private implementation
>> details like Vector.
>>
>>
> So I shouldn't mention at all that internally this ADT contains a
> std::vector?
>
>
>> How is the user of this function supposed to know when appending the
>> entry causes the vector to be unsorted?
>>
>>
> The only way they know they are keeping the vector sorted is if they can
> guarantee that they had inserted every element in whatever sorted order
> that they specified in the CMP template parameter. Otherwise they should
> know to call sortUnique.
>
>
>> > +  void push_back(const value_type &Entry) {
>>
>> Should be using the typedef for const_reference.
>>
>> > +    if (!Vector.size())
>>
>> Vector.empty() is more clear.
>>
>>
> Both noted.
>
>
>> > +      Vector.push_back(Entry);
>>
>> After pushing back in an empty vector, why are you relying on the
>> std::equal_to to perform the return?
>>
>>
> If the Vector is empty I add the Entry no matter what.
> Since it is the first Entry, it is sorted and I can return.
> Entry is equal to itself, so it just returns.
>
>
>> > +
>> > +    // If Entry is a duplicate of the previous Entry we skip.
>> > +    if (std::equal_to<value_type>()(Vector.back(), Entry))
>> > +      return;
>>
>> Why is this forcing use of std::equal_to, but we can specialize on
>> something other than std::less? If someone is implementing a custom
>> std::less implementation, it's very likely they would want a custom
>> std::equal_to used as well.
>>
>>
> Make sense, I think could be added as a template parameter.
>
>
>> > +
>> > +    IsSorted &= (CMP()(Vector.back(), Entry));
>> > +    Vector.push_back(Entry);
>> > +  }
>> > +
>> > +  // \brief Sorts and uniques Vector.
>> > +  void sortUnique() {
>> > +    if (IsSorted)
>> > +      return;
>>
>> This still isn't correct. Consider code like:
>>
>> SortedVector<int> V;
>> V.push_back(1);
>> V.push_back(2);
>> V.push_back(3);
>> *V.begin() = 3;
>>
>> Not only is the vector no longer sorted (but it will claim that it is
>> because IsSorted will be true), it's also not unique. We then fail to
>> sort or unique the backing store.
>>
>> In order for this ADT to be viable, the underlying vector elements
>> must be const so that you can never mutate their values through the
>> ADT interface.
>
>
>
>> +
>> > +    std::sort(Vector.begin(), Vector.end());
>> > +    Vector.erase(std::unique(Vector.begin(), Vector.end()),
>> Vector.end());
>>
>> Neither of these STL functions are using the custom comparator.
>> Instead they use operator< on the object type, which is inconsistent.
>> Should pass in CMP() to std::sort(), and should pass in the template
>> parameter for std::equal_to for std::unique().
>>
>> > +    IsSorted = true;
>> > +  }
>> > +
>> > +  /// \brief Tells if Entry is in Vector without relying on
>> sorted-uniqueness.
>> > +  bool has(value_type Entry) const {
>>
>> This should be using const_reference instead of forcing a by-value copy.
>>
>> > +    if (IsSorted)
>> > +      return std::binary_search(Vector.begin(), Vector.end(), Entry);
>>
>> This function is using operator< to perform the comparisons instead of
>> using CMP().
>>
>> > +
>> > +    return std::find(Vector.begin(), Vector.end(), Entry) !=
>> Vector.end();
>>
>> This function is using operator== to perform the comparisons instead
>> of using a template parameter for std::equal_to.
>>
>>
> I can provide another revised patch with these feedback in mind.
>
>
>> > +  }
>>
>> I find it really strange that so many member functions call
>> AssertIsSorted(), but the only truly interesting part of the interface
>> does not. Why is it valid for you to call has() without using a sorted
>> vector? This means the complexity of calling this function is either
>> O(N) or O(log N), and you have no way of knowing which you'll get
>> unless you call sortUnique() first.
>>
>> > +
>>
>
> There are places that call MachineBasicBlock::isLiveIn. They just need to
> know if something is in the set or not.
> I thought this would be a way to migrate existing code that checked if the
> entry was in the set before appending to it without changing a lot of the
> client code just yet.
>
> Me and Quentin actually discussed this in the original feedback for the
> patch.
> We were planning to add a heuristic to decide after a certain number of
> calls to has whether to sort.
>
>
>> > +  /// \brief Returns a reference to the Entry with the specified index.
>> > +  const value_type &operator[](unsigned index) const {
>>
>> This should be returning a const_reference and taking a size_type.
>>
>> > +    assert(index < size() && "SortedVector index is out of range!");
>> > +    AssertIsSorted();
>> > +    return Vector[index];
>> > +  }
>> > +
>> > +  /// \brief returns an iterator to the start of vector. Asserts if
>> vector is
>> > +  /// not sorted and uniqued.
>> > +  iterator begin() {
>> > +    AssertIsSorted();
>> > +    return Vector.begin();
>> > +  }
>> > +
>> > +  /// \brief Returns a const_iterator to the start of Vector. Asserts if
>> > +  /// Vector is not sorted and uniqued.
>> > +  const_iterator begin() const {
>> > +    AssertIsSorted();
>> > +    return Vector.begin();
>> > +  }
>> > +
>> > +  /// \brief returns an iterator to the end of vector. Asserts if
>> vector is
>> > +  /// not sorted and uniqued.
>> > +  iterator end() {
>> > +    AssertIsSorted();
>> > +    return Vector.end();
>> > +  }
>> > +
>> > +  /// \brief Returns a const_iterator to the end of Vector. Asserts if
>> > +  /// Vector is not sorted and uniqued.
>> > +  const_iterator end() const {
>> > +    AssertIsSorted();
>> > +    return Vector.end();
>> > +  }
>> > +
>> > +  /// \brief Erases Vector at position. Asserts if Vector not sorted and
>> > +  /// uniqued.
>> > +  iterator erase(iterator position) {
>> > +    AssertIsSorted();
>>
>> A vector iterator is a direct index into the vector; why is this
>> assertion required for erasing an arbitrary element?
>>
>
> Ah yeah, this should be taken out.
>
>
>>
>> > +    return Vector.erase(position);
>> > +  }
>> > +
>> > +  /// \brief Returns the total number of entries in Vector. Asserts if
>> Vector
>> > +  /// not sorted and uniqued.
>> > +  size_type size() const {
>> > +    AssertIsSorted();
>>
>> Why is this assertion required for testing the size?
>>
>>
> My thinking was, that the size of the number of meaningful entries in the
> set would change when it gets uniqued so calling size before calling
> sortUnique not correct.
>
>
>> > +    return Vector.size();
>> > +  }
>> > +
>> > +  /// \brief Returns true if Vector is empty.
>> > +  bool empty() const {
>> > +    return Vector.empty();
>> > +  }
>> > +
>> > +  /// \brief Clears all Entries in Vector.
>> > +  void clear() {
>> > +    IsSorted = true;
>> > +    Vector.clear();
>> > +  }
>> > +};
>>
>> There is a lot of functionality missing from this interface that are
>> concerns. For instance, there's no rvalue references, no constructors,
>> pop_back, rbegin()/rend(), etc. This seems like a very limited
>> interface in some ways, while way too permissive of an interface in
>> other ways.
>>
>
> That can be said for other ADTs as well, and I didn't know these were
> requirements.
> Do I need to have everyone of these standard functions out the gate just to
> add a new ADT?
>
>
>>
>> Even if you expanded the interface to cover more of the useful cases
>> and fix the issues from the review, I am not certain this meets the
>> bar to be an ADT. This, ultimately, is the stock STL vector that needs
>> to remain ordered and unique. Why is an existing ordered, unique list
>> (like std::set) insufficient?
>>
>>
> Because I think the implementation would be too messy in the code that used
> it, because I thought it would be better than having the same 3 lines of
> boilerplate sort-unique code copied and pasted, and also because I really
> thought this was a common enough pattern and that with the compile time
> benefit for the use case it was trying to address would be sufficient
> reason to add this one ADT to llvm.

Well, there's a trade off. To add something to ADT it needs to be useful
enough to justify its complexity. If instead we make this type private
to the implementation of MachineBasicBlock, a lot of the abstraction is
unnecessary: We can use operator< rather than allowing users to provide
std::less, we don't need the typedefs for concepts we don't plan to use,
etc.

OTOH, if this simplifies a common pattern, then the extra complexity to
implement a more generic ADT is worthwhile.

> You mentioned unique list (I assume You meant UniqueVector?), I don't think
> UniqueVector would pass some of the requirements stated here because I
> don't think the interface has all the functionality you mentioned was
> missing in SortedVector.
>
> Also, I don't think the compile time impact of using std::set or
> UniqueVector makes them a good candidate for replacing std::vector in
> MachineBasicBlock::LiveIns (both use a red black tree if I am not
> mistaken).

So this is the interesting part. The use case where a sorted/uniqued
std::vector is obviously better than a set type is where there's a phase
of filling the vector, some convenient place to sort, and then a phase
of look ups. If you need to sort and unique repeatedly it's less obvious
which data type will be better.

I get the impression from the implementation that the point of sorted
vector is generally to make it easier to catch mistakes when doing the
fill-sort-use kind of use case. Is that right?

> I can go and change MachineBasicBlock::LiveIns' iterators to just call sort
> on the std::vector it was using but I think it would really make that code
> less readable, and I think there are probably plenty of other places in
> LLVM that could use this (albeit very simple) ADT.

You could also create a private SortedVector type as a helper for
MachineBasicBlock's LiveIns implementation - the discussion on whether
the type belongs in ADT is orthogonal to whether the type is useful for
your use case.

> -PL
>
>
>
>> ~Aaron
>>
>> > +
>> > +} // End of namespace llvm
>> > +
>> > +#endif // LLVM_ADT_SORTEDVECTOR_H
>> > +
>> >
>>



More information about the llvm-commits mailing list