[RFC] SortedVector ADT

David Blaikie dblaikie at gmail.com
Fri Jul 31 10:09:54 PDT 2015


On Fri, Jul 31, 2015 at 9:44 AM, Puyan Lotfi <puyan at puyan.org> wrote:

>
>
> On Fri, Jul 31, 2015 at 9:34 AM, Justin Bogner <mail at justinbogner.com>
> wrote:
>
>> Puyan Lotfi <puyan at puyan.org> writes:
>> > On Fri, Jul 31, 2015 at 6:27 AM, Aaron Ballman <aaron at aaronballman.com>
>> > wrote:
>> >
>> >> On Wed, Jul 29, 2015 at 8:40 PM, Puyan Lotfi <plotfi at apple.com> wrote:
>> >> > Justin, Aaron:
>> >> >
>> >> > These are the revised patches.
>> >> > I have tried to address all of your feedback with the exception of
>> the
>> >> naming of the ADT.
>> >> > I was thinking, if I did rename it I could go with SortedVectorSet or
>> >> SortedSet.
>> >> >
>> >> > Anyways, I have added:
>> >> >
>> >> > - Better documentation changes that try to be more clear on how and
>> when
>> >> to use SortedVector.
>> >> > - Better comments.
>> >> > - Removed erase().
>> >> > - Replaced reset with clear.
>> >> > - Renamed insert to push_back
>> >> > - Added unit testing.
>> >> >
>> >> > Hope this is enough to put my patch back in; please let me know if I
>> >> missed anything.
>> >>
>> >> Comments below. Btw, I think this discussion should be on its own
>> >> thread instead of attached to a commit revert message. I suspect there
>> >> are others in the community who have opinions on this ADT, but are
>> >> unaware of the discussion due to context. I've taken the liberty of
>> >> starting a new email thread for this, and included a few usual
>> >> suspects who review ADTs.
>> >>
>> >> For anyone joining in the middle, this is a discussion of an ADT Puyan
>> >> proposed in:
>> >>
>> >>
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150720/288930.html
>> >>
>> >> That was commit (and eventually reverted) in:
>> >>
>> >>
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150727/289945.html
>> >>
>> >> > diff --git a/include/llvm/ADT/SortedVector.h
>> >> b/include/llvm/ADT/SortedVector.h
>> >> > new file mode 100644
>> >> > index 0000000..ddafd6b
>> >> > --- /dev/null
>> >> > +++ b/include/llvm/ADT/SortedVector.h
>> >> > @@ -0,0 +1,137 @@
>> >> > +//===-- llvm/ADT/SortedVector.h ---------------------------------*-
>> C++
>> >> -*-===//
>> >> > +//
>> >> > +//                     The LLVM Compiler Infrastructure
>> >> > +//
>> >> > +// This file is distributed under the University of Illinois Open
>> Source
>> >> > +// License. See LICENSE.TXT for details.
>> >> > +//
>> >> >
>> >>
>> +//===----------------------------------------------------------------------===//
>> >> > +
>> >> > +#ifndef LLVM_ADT_SORTEDVECTOR_H
>> >> > +#define LLVM_ADT_SORTEDVECTOR_H
>> >> > +
>> >> > +#include <vector>
>> >> > +#include <cassert>
>> >> > +#include <functional>
>> >> > +
>> >> > +namespace llvm {
>> >> > +
>> >> > +/// \brief A vector with minimal append overhead that enforces a
>> sorted
>> >> > +/// ordering of unique elements prior to access. The most common use
>> >> case
>> >> > +/// is to push_back multiple Entries and call sortUnique afterward.
>> >>
>> >> I am wondering why this should be an ADT instead of one-off
>> >> functionality. I see that you're going to use it for
>> >> MachineBasicBlock, but are there other use cases as well? The kind of
>> >> functionality that this class provides is covered by basic STL
>> >> mechanisms, and so I am hesitant to support it as an ADT for all of
>> >> our users unless this is something we expect will be heavily used (or
>> >> is heavily customized to support toolchain needs).
>> >>
>> >>
>> > Originally I had added a patch to improve compile time in
>> VirtRegRewriter,
>> > and me and Quentin had noticed that these LiveIns set in
>> MachineBasicBlock
>> > we not actually in sorted order in all cases. We thought this could be a
>> > good way to maintain the ordering for the LiveIns set with low overhead
>> > since they were mostly sorted as a result of how they were being
>> inserted
>> > into. Does a new ADT really need more than one client out the gate?
>> >
>> >
>> >> Also, the comments do not specify what ordering it is sorted by (we
>> >> can presume from the std::less, but it would be good to be explicit
>> >> since this is user-facing documentation).
>> >>
>> >
>> > Noted. I can add that comment.
>> >
>> >
>> >>
>> >> > +template<typename T, typename CMP = std::less<T>>
>> >> > +class SortedVector {
>> >> > +private:
>> >> > +  typedef T value_type;
>> >> > +  typedef typename std::vector<value_type> vector_type;
>> >> > +  typedef typename vector_type::const_iterator const_iterator;
>> >> > +  typedef typename vector_type::iterator iterator;
>> >> > +  typedef typename vector_type::size_type size_type;
>> >>
>> >> These typedefs should be public, aside from vector_type.
>> >>
>> >>
>> > I must have mistaken the previous feedback, and thought you wanted them
>> all
>> > private. Sorry.
>> >
>> >
>> >> Missing typedefs for difference_type, reference and const_reference.
>> >>
>> >>
>> > I didn't see any of this in say, UniqueVector. But I see it they are in
>> > SetVector. Yeah, I can certainly go add those.
>> >
>> >
>> >> > +
>> >> > +  vector_type Vector;
>> >> > +  bool IsSorted = true;
>> >> > +
>> >> > +  void AssertIsSorted() const {
>> >> > +    assert(IsSorted && "Unsorted SortedVector access; call
>> sortUnique
>> >> prior.");
>> >> > +  }
>> >> > +
>> >> > +public:
>> >> > +  /// \brief Appends Entry to the end of Vector; avoids appending
>> >> duplicate
>> >> > +  /// entries. If appending Entry causes Vector to be unsorted, then
>> >> sortUnique
>> >> > +  /// must be called prior to access.
>> >>
>> >> The public comments are still discussing private implementation
>> >> details like Vector.
>> >>
>> >>
>> > So I shouldn't mention at all that internally this ADT contains a
>> > std::vector?
>> >
>> >
>> >> How is the user of this function supposed to know when appending the
>> >> entry causes the vector to be unsorted?
>> >>
>> >>
>> > The only way they know they are keeping the vector sorted is if they can
>> > guarantee that they had inserted every element in whatever sorted order
>> > that they specified in the CMP template parameter. Otherwise they should
>> > know to call sortUnique.
>> >
>> >
>> >> > +  void push_back(const value_type &Entry) {
>> >>
>> >> Should be using the typedef for const_reference.
>> >>
>> >> > +    if (!Vector.size())
>> >>
>> >> Vector.empty() is more clear.
>> >>
>> >>
>> > Both noted.
>> >
>> >
>> >> > +      Vector.push_back(Entry);
>> >>
>> >> After pushing back in an empty vector, why are you relying on the
>> >> std::equal_to to perform the return?
>> >>
>> >>
>> > If the Vector is empty I add the Entry no matter what.
>> > Since it is the first Entry, it is sorted and I can return.
>> > Entry is equal to itself, so it just returns.
>> >
>> >
>> >> > +
>> >> > +    // If Entry is a duplicate of the previous Entry we skip.
>> >> > +    if (std::equal_to<value_type>()(Vector.back(), Entry))
>> >> > +      return;
>> >>
>> >> Why is this forcing use of std::equal_to, but we can specialize on
>> >> something other than std::less? If someone is implementing a custom
>> >> std::less implementation, it's very likely they would want a custom
>> >> std::equal_to used as well.
>> >>
>> >>
>> > Make sense, I think could be added as a template parameter.
>> >
>> >
>> >> > +
>> >> > +    IsSorted &= (CMP()(Vector.back(), Entry));
>> >> > +    Vector.push_back(Entry);
>> >> > +  }
>> >> > +
>> >> > +  // \brief Sorts and uniques Vector.
>> >> > +  void sortUnique() {
>> >> > +    if (IsSorted)
>> >> > +      return;
>> >>
>> >> This still isn't correct. Consider code like:
>> >>
>> >> SortedVector<int> V;
>> >> V.push_back(1);
>> >> V.push_back(2);
>> >> V.push_back(3);
>> >> *V.begin() = 3;
>> >>
>> >> Not only is the vector no longer sorted (but it will claim that it is
>> >> because IsSorted will be true), it's also not unique. We then fail to
>> >> sort or unique the backing store.
>> >>
>> >> In order for this ADT to be viable, the underlying vector elements
>> >> must be const so that you can never mutate their values through the
>> >> ADT interface.
>> >
>> >
>> >
>> >we should also try to remove any existing ADTs that aren't as compelling
>> to have anymore?> +
>> >> > +    std::sort(Vector.begin(), Vector.end());
>> >> > +    Vector.erase(std::unique(Vector.begin(), Vector.end()),
>> >> Vector.end());
>> >>
>> >> Neither of these STL functions are using the custom comparator.
>> >> Instead they use operator< on the object type, which is inconsistent.
>> >> Should pass in CMP() to std::sort(), and should pass in the template
>> >> parameter for std::equal_to for std::unique().
>> >>
>> >> > +    IsSorted = true;
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Tells if Entry is in Vector without relying on
>> >> sorted-uniqueness.
>> >> > +  bool has(value_type Entry) const {
>> >>
>> >> This should be using const_reference instead of forcing a by-value
>> copy.
>> >>
>> >> > +    if (IsSorted)
>> >> > +      return std::binary_search(Vector.begin(), Vector.end(),
>> Entry);
>> >>
>> >> This function is using operator< to perform the comparisons instead of
>> >> using CMP().
>> >>
>> >> > +
>> >> > +    return std::find(Vector.begin(), Vector.end(), Entry) !=
>> >> Vector.end();
>> >>
>> >> This function is using operator== to perform the comparisons instead
>> >> of using a template parameter for std::equal_to.
>> >>
>> >>
>> > I can provide another revised patch with these feedback in mind.
>> >
>> >
>> >> > +  }
>> >>
>> >> I find it really strange that so many member functions call
>> >> AssertIsSorted(), but the only truly interesting part of the interface
>> >> does not. Why is it valid for you to call has() without using a sorted
>> >> vector? This means the complexity of calling this function is either
>> >> O(N) or O(log N), and you have no way of knowing which you'll get
>> >> unless you call sortUnique() first.
>> >>
>> >> > +
>> >>
>> >
>> > There are places that call MachineBasicBlock::isLiveIn. They just need
>> to
>> > know if something is in the set or not.
>> > I thought this would be a way to migrate existing code that checked if
>> the
>> > entry was in the set before appending to it without changing a lot of
>> the
>> > client code just yet.
>> >
>> > Me and Quentin actually discussed this in the original feedback for the
>> > patch.
>> > We were planning to add a heuristic to decide after a certain number of
>> > calls to has whether to sort.
>> >
>> >
>> >> > +  /// \brief Returns a reference to the Entry with the specified
>> index.
>> >> > +  const value_type &operator[](unsigned index) const {
>> >>
>> >> This should be returning a const_reference and taking a size_type.
>> >>
>> >> > +    assert(index < size() && "SortedVector index is out of range!");
>> >> > +    AssertIsSorted();
>> >> > +    return Vector[index];
>> >> > +  }
>> >> > +
>> >> > +  /// \brief returns an iterator to the start of vector. Asserts if
>> >> vector is
>> >> > +  /// not sorted and uniqued.
>> >> > +  iterator begin() {
>> >> > +    AssertIsSorted();
>> >> > +    return Vector.begin();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Returns a const_iterator to the start of Vector.
>> Asserts if
>> >> > +  /// Vector is not sorted and uniqued.
>> >> > +  const_iterator begin() const {
>> >> > +    AssertIsSorted();
>> >> > +    return Vector.begin();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief returns an iterator to the end of vector. Asserts if
>> >> vector is
>> >> > +  /// not sorted and uniqued.
>> >> > +  iterator end() {
>> >> > +    AssertIsSorted();
>> >> > +    return Vector.end();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Returns a const_iterator to the end of Vector. Asserts
>> if
>> >> > +  /// Vector is not sorted and uniqued.
>> >> > +  const_iterator end() const {
>> >> > +    AssertIsSorted();
>> >> > +    return Vector.end();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Erases Vector at position. Asserts if Vector not
>> sorted and
>> >> > +  /// uniqued.
>> >> > +  iterator erase(iterator position) {
>> >> > +    AssertIsSorted();
>> >>
>> >> A vector iterator is a direct index into the vector; why is this
>> >> assertion required for erasing an arbitrary element?
>> >>
>> >
>> > Ah yeah, this should be taken out.
>> >
>> >
>> >>
>> >> > +    return Vector.erase(position);
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Returns the total number of entries in Vector. Asserts
>> if
>> >> Vector
>> >> > +  /// not sorted and uniqued.
>> >> > +  size_type size() const {
>> >> > +    AssertIsSorted();
>> >>
>> >> Why is this assertion required for testing the size?
>> >>
>> >>
>> > My thinking was, that the size of the number of meaningful entries in
>> the
>> > set would change when it gets uniqued so calling size before calling
>> > sortUnique not correct.
>> >
>> >
>> >> > +    return Vector.size();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Returns true if Vector is empty.
>> >> > +  bool empty() const {
>> >> > +    return Vector.empty();
>> >> > +  }
>> >> > +
>> >> > +  /// \brief Clears all Entries in Vector.
>> >> > +  void clear() {
>> >> > +    IsSorted = true;
>> >> > +    Vector.clear();
>> >> > +  }
>> >> > +};
>> >>
>> >> There is a lot of functionality missing from this interface that are
>> >> concerns. For instance, there's no rvalue references, no constructors,
>> >> pop_back, rbegin()/rend(), etc. This seems like a very limited
>> >> interface in some ways, while way too permissive of an interface in
>> >> other ways.
>> >>
>> >
>> > That can be said for other ADTs as well, and I didn't know these were
>> > requirements.
>> > Do I need to have everyone of these standard functions out the gate
>> just to
>> > add a new ADT?
>> >
>> >
>> >>
>> >> Even if you expanded the interface to cover more of the useful cases
>> >> and fix the issues from the review, I am not certain this meets the
>> >> bar to be an ADT. This, ultimately, is the stock STL vector that needs
>> >> to remain ordered and unique. Why is an existing ordered, unique list
>> >> (like std::set) insufficient?
>> >>
>> >>
>> > Because I think the implementation would be too messy in the code that
>> used
>> > it, because I thought it would be better than having the same 3 lines of
>> > boilerplate sort-unique code copied and pasted, and also because I
>> really
>> > thought this was a common enough pattern and that with the compile time
>> > benefit for the use case it was trying to address would be sufficient
>> > reason to add this one ADT to llvm.
>>
>> Well, there's a trade off. To add something to ADT it needs to be useful
>> enough to justify its complexity. If instead we make this type private
>> to the implementation of MachineBasicBlock, a lot of the abstraction is
>> unnecessary: We can use operator< rather than allowing users to provide
>> std::less, we don't need the typedefs for concepts we don't plan to use,
>> etc.
>>
>> OTOH, if this simplifies a common pattern, then the extra complexity to
>> implement a more generic ADT is worthwhile.
>>
>> > You mentioned unique list (I assume You meant UniqueVector?), I don't
>> think
>> > UniqueVector would pass some of the requirements stated here because I
>> > don't think the interface has all the functionality you mentioned was
>> > missing in SortedVector.
>> >
>> > Also, I don't think the compile time impact of using std::set or
>> > UniqueVector makes them a good candidate for replacing std::vector in
>> > MachineBasicBlock::LiveIns (both use a red black tree if I am not
>> > mistaken).
>>
>> So this is the interesting part. The use case where a sorted/uniqued
>> std::vector is obviously better than a set type is where there's a phase
>> of filling the vector, some convenient place to sort, and then a phase
>> of look ups. If you need to sort and unique repeatedly it's less obvious
>> which data type will be better.
>>
>> I get the impression from the implementation that the point of sorted
>> vector is generally to make it easier to catch mistakes when doing the
>> fill-sort-use kind of use case. Is that right?
>>
>>
> Yes, this is exactly right.
>

It sounds like Justin described two different things: a write-sort-read and
a "maintain sorted". Does this usage need both of those? I'd hesitate to
build that into a data structure for common use as it seems a bit narrow.
If it's only one of them (maintain sorted /or/ write-sort-read) then a
generic data structure might be useful, but the write-sort-read doesn't
seem to have a lot of abstraction gains, it's just a call to sort at one
point and documentation that it's unsorted before that point and sorted
after. If there was an abstraction to add it'd simply be a debug aid to
ensure certain operations were only used in one state or the other.


> Originally I had done some compile time performance measurements of llvm
> and noticed that some of the code in VirtRegRewriter was slowed down by
> constant calls to isLiveIns (basically, a std::find on every insertion),
> and so I replaced it all with the standard sorting code. Later it was deem
> important to also keep the LiveIns from getting unsorted and since it's
> supposed to be a set  also un-uniqued.
>

An API that had a clearer/narrower contract might be useful, but I'd still
be unsure whether it has much applicability outside this particular
situation.

By clearer/narrower API I mean supporting that particular state sequence:
an unsorted state where nothing interesting happens. An explicit transition
to a sorted state (& possibly a way to get out of that state again). And a
means to maintain that state once you're in it.

But that does seem subtle to have different costs on the same operation
depending on state.

Do you have performance numbers comparing this to something like
std::set/vector+SortedVector, or whatnot?


> I looked at different ways of doing this as reflected in different
> versions of the patch and the most straight forward was to have an ADT that
> could be reused but also isolate it's use to MachineBasicBlock because
> these LiveIns are added to everywhere (CodeGen, Targets, etc).
>
>
>> > I can go and change MachineBasicBlock::LiveIns' iterators to just call
>> sort
>> > on the std::vector it was using but I think it would really make that
>> code
>> > less readable, and I think there are probably plenty of other places in
>> > LLVM that could use this (albeit very simple) ADT.
>>
>> You could also create a private SortedVector type as a helper for
>> MachineBasicBlock's LiveIns implementation - the discussion on whether
>> the type belongs in ADT is orthogonal to whether the type is useful for
>> your use case.
>>
>>
> Sure, I guess that works for now.
> I guess I had assumed that most should end up in ADT unless they are
> highly specialized for the code using it.
>
> Thanks Justin
>
> -PL
>
>
>> > -PL
>> >
>> >
>> >
>> >> ~Aaron
>> >>
>> >> > +
>> >> > +} // End of namespace llvm
>> >> > +
>> >> > +#endif // LLVM_ADT_SORTEDVECTOR_H
>> >> > +
>> >> >
>> >>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150731/4c16d5c7/attachment.html>


More information about the llvm-commits mailing list