[RFC] SortedVector ADT

Aaron Ballman aaron at aaronballman.com
Fri Jul 31 06:27:45 PDT 2015


On Wed, Jul 29, 2015 at 8:40 PM, Puyan Lotfi <plotfi at apple.com> wrote:
> Justin, Aaron:
>
> These are the revised patches.
> I have tried to address all of your feedback with the exception of the naming of the ADT.
> I was thinking, if I did rename it I could go with SortedVectorSet or SortedSet.
>
> Anyways, I have added:
>
> - Better documentation changes that try to be more clear on how and when to use SortedVector.
> - Better comments.
> - Removed erase().
> - Replaced reset with clear.
> - Renamed insert to push_back
> - Added unit testing.
>
> Hope this is enough to put my patch back in; please let me know if I missed anything.

Comments below. Btw, I think this discussion should be on its own
thread instead of attached to a commit revert message. I suspect there
are others in the community who have opinions on this ADT, but are
unaware of the discussion due to context. I've taken the liberty of
starting a new email thread for this, and included a few usual
suspects who review ADTs.

For anyone joining in the middle, this is a discussion of an ADT Puyan
proposed in:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150720/288930.html

That was commit (and eventually reverted) in:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150727/289945.html

> diff --git a/include/llvm/ADT/SortedVector.h b/include/llvm/ADT/SortedVector.h
> new file mode 100644
> index 0000000..ddafd6b
> --- /dev/null
> +++ b/include/llvm/ADT/SortedVector.h
> @@ -0,0 +1,137 @@
> +//===-- llvm/ADT/SortedVector.h ---------------------------------*- C++ -*-===//
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
> +//===----------------------------------------------------------------------===//
> +
> +#ifndef LLVM_ADT_SORTEDVECTOR_H
> +#define LLVM_ADT_SORTEDVECTOR_H
> +
> +#include <vector>
> +#include <cassert>
> +#include <functional>
> +
> +namespace llvm {
> +
> +/// \brief A vector with minimal append overhead that enforces a sorted
> +/// ordering of unique elements prior to access. The most common use case
> +/// is to push_back multiple Entries and call sortUnique afterward.

I am wondering why this should be an ADT instead of one-off
functionality. I see that you're going to use it for
MachineBasicBlock, but are there other use cases as well? The kind of
functionality that this class provides is covered by basic STL
mechanisms, and so I am hesitant to support it as an ADT for all of
our users unless this is something we expect will be heavily used (or
is heavily customized to support toolchain needs).

Also, the comments do not specify what ordering it is sorted by (we
can presume from the std::less, but it would be good to be explicit
since this is user-facing documentation).

> +template<typename T, typename CMP = std::less<T>>
> +class SortedVector {
> +private:
> +  typedef T value_type;
> +  typedef typename std::vector<value_type> vector_type;
> +  typedef typename vector_type::const_iterator const_iterator;
> +  typedef typename vector_type::iterator iterator;
> +  typedef typename vector_type::size_type size_type;

These typedefs should be public, aside from vector_type.

Missing typedefs for difference_type, reference and const_reference.

> +
> +  vector_type Vector;
> +  bool IsSorted = true;
> +
> +  void AssertIsSorted() const {
> +    assert(IsSorted && "Unsorted SortedVector access; call sortUnique prior.");
> +  }
> +
> +public:
> +  /// \brief Appends Entry to the end of Vector; avoids appending duplicate
> +  /// entries. If appending Entry causes Vector to be unsorted, then sortUnique
> +  /// must be called prior to access.

The public comments are still discussing private implementation
details like Vector.

How is the user of this function supposed to know when appending the
entry causes the vector to be unsorted?

> +  void push_back(const value_type &Entry) {

Should be using the typedef for const_reference.

> +    if (!Vector.size())

Vector.empty() is more clear.

> +      Vector.push_back(Entry);

After pushing back in an empty vector, why are you relying on the
std::equal_to to perform the return?

> +
> +    // If Entry is a duplicate of the previous Entry we skip.
> +    if (std::equal_to<value_type>()(Vector.back(), Entry))
> +      return;

Why is this forcing use of std::equal_to, but we can specialize on
something other than std::less? If someone is implementing a custom
std::less implementation, it's very likely they would want a custom
std::equal_to used as well.

> +
> +    IsSorted &= (CMP()(Vector.back(), Entry));
> +    Vector.push_back(Entry);
> +  }
> +
> +  // \brief Sorts and uniques Vector.
> +  void sortUnique() {
> +    if (IsSorted)
> +      return;

This still isn't correct. Consider code like:

SortedVector<int> V;
V.push_back(1);
V.push_back(2);
V.push_back(3);
*V.begin() = 3;

Not only is the vector no longer sorted (but it will claim that it is
because IsSorted will be true), it's also not unique. We then fail to
sort or unique the backing store.

In order for this ADT to be viable, the underlying vector elements
must be const so that you can never mutate their values through the
ADT interface.

> +
> +    std::sort(Vector.begin(), Vector.end());
> +    Vector.erase(std::unique(Vector.begin(), Vector.end()), Vector.end());

Neither of these STL functions are using the custom comparator.
Instead they use operator< on the object type, which is inconsistent.
Should pass in CMP() to std::sort(), and should pass in the template
parameter for std::equal_to for std::unique().

> +    IsSorted = true;
> +  }
> +
> +  /// \brief Tells if Entry is in Vector without relying on sorted-uniqueness.
> +  bool has(value_type Entry) const {

This should be using const_reference instead of forcing a by-value copy.

> +    if (IsSorted)
> +      return std::binary_search(Vector.begin(), Vector.end(), Entry);

This function is using operator< to perform the comparisons instead of
using CMP().

> +
> +    return std::find(Vector.begin(), Vector.end(), Entry) != Vector.end();

This function is using operator== to perform the comparisons instead
of using a template parameter for std::equal_to.

> +  }

I find it really strange that so many member functions call
AssertIsSorted(), but the only truly interesting part of the interface
does not. Why is it valid for you to call has() without using a sorted
vector? This means the complexity of calling this function is either
O(N) or O(log N), and you have no way of knowing which you'll get
unless you call sortUnique() first.

> +
> +  /// \brief Returns a reference to the Entry with the specified index.
> +  const value_type &operator[](unsigned index) const {

This should be returning a const_reference and taking a size_type.

> +    assert(index < size() && "SortedVector index is out of range!");
> +    AssertIsSorted();
> +    return Vector[index];
> +  }
> +
> +  /// \brief returns an iterator to the start of vector. Asserts if vector is
> +  /// not sorted and uniqued.
> +  iterator begin() {
> +    AssertIsSorted();
> +    return Vector.begin();
> +  }
> +
> +  /// \brief Returns a const_iterator to the start of Vector. Asserts if
> +  /// Vector is not sorted and uniqued.
> +  const_iterator begin() const {
> +    AssertIsSorted();
> +    return Vector.begin();
> +  }
> +
> +  /// \brief returns an iterator to the end of vector. Asserts if vector is
> +  /// not sorted and uniqued.
> +  iterator end() {
> +    AssertIsSorted();
> +    return Vector.end();
> +  }
> +
> +  /// \brief Returns a const_iterator to the end of Vector. Asserts if
> +  /// Vector is not sorted and uniqued.
> +  const_iterator end() const {
> +    AssertIsSorted();
> +    return Vector.end();
> +  }
> +
> +  /// \brief Erases Vector at position. Asserts if Vector not sorted and
> +  /// uniqued.
> +  iterator erase(iterator position) {
> +    AssertIsSorted();

A vector iterator is a direct index into the vector; why is this
assertion required for erasing an arbitrary element?

> +    return Vector.erase(position);
> +  }
> +
> +  /// \brief Returns the total number of entries in Vector. Asserts if Vector
> +  /// not sorted and uniqued.
> +  size_type size() const {
> +    AssertIsSorted();

Why is this assertion required for testing the size?

> +    return Vector.size();
> +  }
> +
> +  /// \brief Returns true if Vector is empty.
> +  bool empty() const {
> +    return Vector.empty();
> +  }
> +
> +  /// \brief Clears all Entries in Vector.
> +  void clear() {
> +    IsSorted = true;
> +    Vector.clear();
> +  }
> +};

There is a lot of functionality missing from this interface that are
concerns. For instance, there's no rvalue references, no constructors,
pop_back, rbegin()/rend(), etc. This seems like a very limited
interface in some ways, while way too permissive of an interface in
other ways.

Even if you expanded the interface to cover more of the useful cases
and fix the issues from the review, I am not certain this meets the
bar to be an ADT. This, ultimately, is the stock STL vector that needs
to remain ordered and unique. Why is an existing ordered, unique list
(like std::set) insufficient?

~Aaron

> +
> +} // End of namespace llvm
> +
> +#endif // LLVM_ADT_SORTEDVECTOR_H
> +
>



More information about the llvm-commits mailing list