[RFC] SortedVector ADT

Puyan Lotfi puyan at puyan.org
Fri Jul 31 09:02:44 PDT 2015


On Fri, Jul 31, 2015 at 6:27 AM, Aaron Ballman <aaron at aaronballman.com>
wrote:

> On Wed, Jul 29, 2015 at 8:40 PM, Puyan Lotfi <plotfi at apple.com> wrote:
> > Justin, Aaron:
> >
> > These are the revised patches.
> > I have tried to address all of your feedback with the exception of the
> naming of the ADT.
> > I was thinking, if I did rename it I could go with SortedVectorSet or
> SortedSet.
> >
> > Anyways, I have added:
> >
> > - Better documentation changes that try to be more clear on how and when
> to use SortedVector.
> > - Better comments.
> > - Removed erase().
> > - Replaced reset with clear.
> > - Renamed insert to push_back
> > - Added unit testing.
> >
> > Hope this is enough to put my patch back in; please let me know if I
> missed anything.
>
> Comments below. Btw, I think this discussion should be on its own
> thread instead of attached to a commit revert message. I suspect there
> are others in the community who have opinions on this ADT, but are
> unaware of the discussion due to context. I've taken the liberty of
> starting a new email thread for this, and included a few usual
> suspects who review ADTs.
>
> For anyone joining in the middle, this is a discussion of an ADT Puyan
> proposed in:
>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150720/288930.html
>
> That was commit (and eventually reverted) in:
>
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150727/289945.html
>
> > diff --git a/include/llvm/ADT/SortedVector.h
> b/include/llvm/ADT/SortedVector.h
> > new file mode 100644
> > index 0000000..ddafd6b
> > --- /dev/null
> > +++ b/include/llvm/ADT/SortedVector.h
> > @@ -0,0 +1,137 @@
> > +//===-- llvm/ADT/SortedVector.h ---------------------------------*- C++
> -*-===//
> > +//
> > +//                     The LLVM Compiler Infrastructure
> > +//
> > +// This file is distributed under the University of Illinois Open Source
> > +// License. See LICENSE.TXT for details.
> > +//
> >
> +//===----------------------------------------------------------------------===//
> > +
> > +#ifndef LLVM_ADT_SORTEDVECTOR_H
> > +#define LLVM_ADT_SORTEDVECTOR_H
> > +
> > +#include <vector>
> > +#include <cassert>
> > +#include <functional>
> > +
> > +namespace llvm {
> > +
> > +/// \brief A vector with minimal append overhead that enforces a sorted
> > +/// ordering of unique elements prior to access. The most common use
> case
> > +/// is to push_back multiple Entries and call sortUnique afterward.
>
> I am wondering why this should be an ADT instead of one-off
> functionality. I see that you're going to use it for
> MachineBasicBlock, but are there other use cases as well? The kind of
> functionality that this class provides is covered by basic STL
> mechanisms, and so I am hesitant to support it as an ADT for all of
> our users unless this is something we expect will be heavily used (or
> is heavily customized to support toolchain needs).
>
>
Originally I had added a patch to improve compile time in VirtRegRewriter,
and me and Quentin had noticed that these LiveIns set in MachineBasicBlock
we not actually in sorted order in all cases. We thought this could be a
good way to maintain the ordering for the LiveIns set with low overhead
since they were mostly sorted as a result of how they were being inserted
into. Does a new ADT really need more than one client out the gate?


> Also, the comments do not specify what ordering it is sorted by (we
> can presume from the std::less, but it would be good to be explicit
> since this is user-facing documentation).
>

Noted. I can add that comment.


>
> > +template<typename T, typename CMP = std::less<T>>
> > +class SortedVector {
> > +private:
> > +  typedef T value_type;
> > +  typedef typename std::vector<value_type> vector_type;
> > +  typedef typename vector_type::const_iterator const_iterator;
> > +  typedef typename vector_type::iterator iterator;
> > +  typedef typename vector_type::size_type size_type;
>
> These typedefs should be public, aside from vector_type.
>
>
I must have mistaken the previous feedback, and thought you wanted them all
private. Sorry.


> Missing typedefs for difference_type, reference and const_reference.
>
>
I didn't see any of this in say, UniqueVector. But I see it they are in
SetVector. Yeah, I can certainly go add those.


> > +
> > +  vector_type Vector;
> > +  bool IsSorted = true;
> > +
> > +  void AssertIsSorted() const {
> > +    assert(IsSorted && "Unsorted SortedVector access; call sortUnique
> prior.");
> > +  }
> > +
> > +public:
> > +  /// \brief Appends Entry to the end of Vector; avoids appending
> duplicate
> > +  /// entries. If appending Entry causes Vector to be unsorted, then
> sortUnique
> > +  /// must be called prior to access.
>
> The public comments are still discussing private implementation
> details like Vector.
>
>
So I shouldn't mention at all that internally this ADT contains a
std::vector?


> How is the user of this function supposed to know when appending the
> entry causes the vector to be unsorted?
>
>
The only way they know they are keeping the vector sorted is if they can
guarantee that they had inserted every element in whatever sorted order
that they specified in the CMP template parameter. Otherwise they should
know to call sortUnique.


> > +  void push_back(const value_type &Entry) {
>
> Should be using the typedef for const_reference.
>
> > +    if (!Vector.size())
>
> Vector.empty() is more clear.
>
>
Both noted.


> > +      Vector.push_back(Entry);
>
> After pushing back in an empty vector, why are you relying on the
> std::equal_to to perform the return?
>
>
If the Vector is empty I add the Entry no matter what.
Since it is the first Entry, it is sorted and I can return.
Entry is equal to itself, so it just returns.


> > +
> > +    // If Entry is a duplicate of the previous Entry we skip.
> > +    if (std::equal_to<value_type>()(Vector.back(), Entry))
> > +      return;
>
> Why is this forcing use of std::equal_to, but we can specialize on
> something other than std::less? If someone is implementing a custom
> std::less implementation, it's very likely they would want a custom
> std::equal_to used as well.
>
>
Make sense, I think could be added as a template parameter.


> > +
> > +    IsSorted &= (CMP()(Vector.back(), Entry));
> > +    Vector.push_back(Entry);
> > +  }
> > +
> > +  // \brief Sorts and uniques Vector.
> > +  void sortUnique() {
> > +    if (IsSorted)
> > +      return;
>
> This still isn't correct. Consider code like:
>
> SortedVector<int> V;
> V.push_back(1);
> V.push_back(2);
> V.push_back(3);
> *V.begin() = 3;
>
> Not only is the vector no longer sorted (but it will claim that it is
> because IsSorted will be true), it's also not unique. We then fail to
> sort or unique the backing store.
>
> In order for this ADT to be viable, the underlying vector elements
> must be const so that you can never mutate their values through the
> ADT interface.



> +
> > +    std::sort(Vector.begin(), Vector.end());
> > +    Vector.erase(std::unique(Vector.begin(), Vector.end()),
> Vector.end());
>
> Neither of these STL functions are using the custom comparator.
> Instead they use operator< on the object type, which is inconsistent.
> Should pass in CMP() to std::sort(), and should pass in the template
> parameter for std::equal_to for std::unique().
>
> > +    IsSorted = true;
> > +  }
> > +
> > +  /// \brief Tells if Entry is in Vector without relying on
> sorted-uniqueness.
> > +  bool has(value_type Entry) const {
>
> This should be using const_reference instead of forcing a by-value copy.
>
> > +    if (IsSorted)
> > +      return std::binary_search(Vector.begin(), Vector.end(), Entry);
>
> This function is using operator< to perform the comparisons instead of
> using CMP().
>
> > +
> > +    return std::find(Vector.begin(), Vector.end(), Entry) !=
> Vector.end();
>
> This function is using operator== to perform the comparisons instead
> of using a template parameter for std::equal_to.
>
>
I can provide another revised patch with these feedback in mind.


> > +  }
>
> I find it really strange that so many member functions call
> AssertIsSorted(), but the only truly interesting part of the interface
> does not. Why is it valid for you to call has() without using a sorted
> vector? This means the complexity of calling this function is either
> O(N) or O(log N), and you have no way of knowing which you'll get
> unless you call sortUnique() first.
>
> > +
>

There are places that call MachineBasicBlock::isLiveIn. They just need to
know if something is in the set or not.
I thought this would be a way to migrate existing code that checked if the
entry was in the set before appending to it without changing a lot of the
client code just yet.

Me and Quentin actually discussed this in the original feedback for the
patch.
We were planning to add a heuristic to decide after a certain number of
calls to has whether to sort.


> > +  /// \brief Returns a reference to the Entry with the specified index.
> > +  const value_type &operator[](unsigned index) const {
>
> This should be returning a const_reference and taking a size_type.
>
> > +    assert(index < size() && "SortedVector index is out of range!");
> > +    AssertIsSorted();
> > +    return Vector[index];
> > +  }
> > +
> > +  /// \brief returns an iterator to the start of vector. Asserts if
> vector is
> > +  /// not sorted and uniqued.
> > +  iterator begin() {
> > +    AssertIsSorted();
> > +    return Vector.begin();
> > +  }
> > +
> > +  /// \brief Returns a const_iterator to the start of Vector. Asserts if
> > +  /// Vector is not sorted and uniqued.
> > +  const_iterator begin() const {
> > +    AssertIsSorted();
> > +    return Vector.begin();
> > +  }
> > +
> > +  /// \brief returns an iterator to the end of vector. Asserts if
> vector is
> > +  /// not sorted and uniqued.
> > +  iterator end() {
> > +    AssertIsSorted();
> > +    return Vector.end();
> > +  }
> > +
> > +  /// \brief Returns a const_iterator to the end of Vector. Asserts if
> > +  /// Vector is not sorted and uniqued.
> > +  const_iterator end() const {
> > +    AssertIsSorted();
> > +    return Vector.end();
> > +  }
> > +
> > +  /// \brief Erases Vector at position. Asserts if Vector not sorted and
> > +  /// uniqued.
> > +  iterator erase(iterator position) {
> > +    AssertIsSorted();
>
> A vector iterator is a direct index into the vector; why is this
> assertion required for erasing an arbitrary element?
>

Ah yeah, this should be taken out.


>
> > +    return Vector.erase(position);
> > +  }
> > +
> > +  /// \brief Returns the total number of entries in Vector. Asserts if
> Vector
> > +  /// not sorted and uniqued.
> > +  size_type size() const {
> > +    AssertIsSorted();
>
> Why is this assertion required for testing the size?
>
>
My thinking was, that the size of the number of meaningful entries in the
set would change when it gets uniqued so calling size before calling
sortUnique not correct.


> > +    return Vector.size();
> > +  }
> > +
> > +  /// \brief Returns true if Vector is empty.
> > +  bool empty() const {
> > +    return Vector.empty();
> > +  }
> > +
> > +  /// \brief Clears all Entries in Vector.
> > +  void clear() {
> > +    IsSorted = true;
> > +    Vector.clear();
> > +  }
> > +};
>
> There is a lot of functionality missing from this interface that are
> concerns. For instance, there's no rvalue references, no constructors,
> pop_back, rbegin()/rend(), etc. This seems like a very limited
> interface in some ways, while way too permissive of an interface in
> other ways.
>

That can be said for other ADTs as well, and I didn't know these were
requirements.
Do I need to have everyone of these standard functions out the gate just to
add a new ADT?


>
> Even if you expanded the interface to cover more of the useful cases
> and fix the issues from the review, I am not certain this meets the
> bar to be an ADT. This, ultimately, is the stock STL vector that needs
> to remain ordered and unique. Why is an existing ordered, unique list
> (like std::set) insufficient?
>
>
Because I think the implementation would be too messy in the code that used
it, because I thought it would be better than having the same 3 lines of
boilerplate sort-unique code copied and pasted, and also because I really
thought this was a common enough pattern and that with the compile time
benefit for the use case it was trying to address would be sufficient
reason to add this one ADT to llvm.

You mentioned unique list (I assume You meant UniqueVector?), I don't think
UniqueVector would pass some of the requirements stated here because I
don't think the interface has all the functionality you mentioned was
missing in SortedVector.

Also, I don't think the compile time impact of using std::set or
UniqueVector makes them a good candidate for replacing std::vector in
MachineBasicBlock::LiveIns (both use a red black tree if I am not
mistaken).

I can go and change MachineBasicBlock::LiveIns' iterators to just call sort
on the std::vector it was using but I think it would really make that code
less readable, and I think there are probably plenty of other places in
LLVM that could use this (albeit very simple) ADT.

-PL



> ~Aaron
>
> > +
> > +} // End of namespace llvm
> > +
> > +#endif // LLVM_ADT_SORTEDVECTOR_H
> > +
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150731/35136cfe/attachment.html>


More information about the llvm-commits mailing list