[PATCH] D26870: Outliner: Add uniquely terminated strings for a suffix tree

Fri Nov 18 15:20:20 PST 2016

MatzeB added a comment.

Hi Jessica,

congratulations on your first llvm patch! Thanks for working on this!

Here's a bunch of mostly nitpicky stuff and a few higher level design questions. We should discuss them; we may not need to address all of the highlevel issues in the first versions:

================
Comment at: include/llvm/ADT/TerminatedString.h:10-14
+// A TerminatedString is a string which has a unique senteniel character at
+// the end of it. Each TerminatedString consists of Characters of some type.
+// A Character is either a Terminator or an instance of some type which can
+// be compared against other types. TerminatedStrings ensure that the strings
+// inserted into a SuffixTree are uniquely terminated.
----------------
Should be a doxygen comment (/// \file ...), see llvm coding standards.

================
Comment at: include/llvm/ADT/TerminatedString.h:18-19
+
+#ifndef TERM_STRING_H
+#define TERM_STRING_H
+
----------------
Should start with `LLVM_ADT_`

================
Comment at: include/llvm/ADT/TerminatedString.h:24
+
+using namespace llvm;
+
----------------
Don't do this in a header, getting extra llvm symbols in the root namespace just because you included a header is unexpected.

================
Comment at: include/llvm/ADT/TerminatedString.h:26
+
+///============================ Character Type ==============================///
+
----------------
This should not be a doxygen comment (or maybe completely removed?)

================
Comment at: include/llvm/ADT/TerminatedString.h:28
+
+/// Character
+/// Represents a character or a terminator.
----------------
Don't repeat the class (or function) name before documenting it. (Doxygen will take the first line until the dot as a short description of the entity).

================
Comment at: include/llvm/ADT/TerminatedString.h:38
+
+template <typename C> struct Character {
+  bool IsTerminator; /// True if the current character is a terminator.
----------------
Do you expect this to be instantiated with various types? Would avoiding the template and hardcoding something like uint32_t make sense? (I didn't look at the using code yet).

================
Comment at: include/llvm/ADT/TerminatedString.h:39
+template <typename C> struct Character {
+  bool IsTerminator; /// True if the current character is a terminator.
+  C Symbol;          /// The actual character, eg 'a', 1, etc.
----------------
doxygen comment after the field should be `///<`

================
Comment at: include/llvm/ADT/TerminatedString.h:41
+  C Symbol;          /// The actual character, eg 'a', 1, etc.
+  size_t Id;         /// If this is a terminator, then this is the Id.
+
----------------
Do expect that many terminators that you need size_t (and are they pointing somewhere) or would an `unsigned` be enough?

I think it should be possible long term to get to a design where this is just a basic datatype (say an int with < 0 for a terminator, >= 0 for a normal symbol).

================
Comment at: include/llvm/ADT/TerminatedString.h:43
+
+  ///----------------  Comparisons between character types ------------------///
+
----------------
should not be doxygen (or removed)

================
Comment at: include/llvm/ADT/TerminatedString.h:45
+
+  /// ==: Equality between character types can only occur when they are of the
+  /// same character class.
----------------
Don't repeat the function name before documenting it. Similar with most following functions.

================
Comment at: include/llvm/ADT/TerminatedString.h:47
+  /// same character class.
+  bool operator==(const Character &Rhs) const {
+    bool IsEqual = false;
----------------
I'd consider using `Character` instead of `const Character&` here given that the whole struct fits easily into two registers on a 64bit machine.
Similar for the other functions.

================
Comment at: include/llvm/ADT/TerminatedString.h:64
+
+    return (Symbol == Rhs);
+  }
----------------
braces seem superfluous.

================
Comment at: include/llvm/ADT/TerminatedString.h:71-72
+
+    /// Terminators can only match against other terminators for string
+    /// comparisons.
+    if (IsTerminator) {
----------------
`//` is enough inside a function. Similar at a few other places.

================
Comment at: include/llvm/ADT/TerminatedString.h:101-120
+    bool IsGreaterThan;
+
+    /// Terminators can only match against other terminators for string
+    /// comparisons.
+    if (IsTerminator) {
+      if (Rhs.IsTerminator)
+        IsGreaterThan = (Id > Rhs.Id);
----------------
Do we actually need a full implementation of >? Or could we just leave it at < and ==?
(May be slightly harder to use in some cases but less code to review and debug in case of errors)

================
Comment at: include/llvm/ADT/TerminatedString.h:144-146
+    OS << Ch.Symbol << " ";
+  else
+    OS << "#" << Ch.Id << "# ";
----------------
I wouldn't expect a suffix space when printing a single character.
`'C'` may be slightly faster than `"C"` for a single character.

================
Comment at: include/llvm/ADT/TerminatedString.h:161-163
+template <typename InputContainer, typename CharacterType>
+struct TerminatedString {
+private:
----------------
Generally: Does this class have enough interesting behaviour of just a SmallVector<CharacterType>/std::vector<Character>? Just providing a few convinience functions for the few bits that differ from a normal container may be the easier and more flexible solution.

================
Comment at: include/llvm/ADT/TerminatedString.h:162
+template <typename InputContainer, typename CharacterType>
+struct TerminatedString {
+private:
----------------
why not `class`?

================
Comment at: include/llvm/ADT/TerminatedString.h:176
+  size_t size() const { return StrContainer.size(); }
+  size_t size() { return StrContainer.size(); }
+  size_t length() { return StrContainer.size() - 1; }
----------------
unnecessary.

================
Comment at: include/llvm/ADT/TerminatedString.h:178
+  size_t length() { return StrContainer.size() - 1; }
+  size_t length() const { return StrContainer.size() - 1; }
+  CharLike getTerminator() const { return StrContainer.back(); }
----------------
unnecessary.

================
Comment at: include/llvm/ADT/TerminatedString.h:179
+  size_t length() const { return StrContainer.size() - 1; }
+  CharLike getTerminator() const { return StrContainer.back(); }
+
----------------
why not call it back() to keep in line with STL containers?

================
Comment at: include/llvm/ADT/TerminatedString.h:181
+
+  ///-------------------  Iterators for TerminatedStrings -------------------///
+  typedef typename CharList::iterator iterator;
----------------
should not be a doxygen comment as it is not about just the next typedef.

================
Comment at: include/llvm/ADT/TerminatedString.h:191
+
+  /// Iterators where the terminator is one past the end.
+  iterator chars_begin() { return StrContainer.begin(); }
----------------
dito.

================
Comment at: include/llvm/ADT/TerminatedString.h:327
+  TerminatedString(const InputContainer &C) {
+    for (auto it = C.begin(); it != C.end(); it++) {
+      CharLike Ch;
----------------
davide wrote:
> same here, can you use a range-loop?
Use range based for: `for (CharLike Ch : C)`

================
Comment at: include/llvm/ADT/TerminatedString.h:344-345
+  TerminatedString(const TerminatedString &other) {
+    for (size_t i = 0; i < other.size(); i++)
+      StrContainer.push_back(other[i]);
+  }
----------------
Just use the copy constructor of StrContainer.

================
Comment at: include/llvm/ADT/TerminatedString.h:353-355
+    Term.Id = StrId;
+    StrContainer.push_back(Term);
+    StrId++;
----------------
Shouldn't you rather maintain the next Id in the SuffixTree instead of a global variable? That would keep the Id space more compressed in case of multiple suffix trees being used. It is also not thread safe even if each thread just stays to its own suffix tree.

================
Comment at: include/llvm/ADT/TerminatedString.h:362
+                        const TerminatedString<InputContainer, CharLike> &Str) {
+  for (auto &Ch : Str)
+    OS << Ch;
----------------
I'd tend to not use a reference as Character is sufficiently small.
Avoid `auto` in cases where it is sufficiently easy to write the type and where it isn't mention anyway somewhere in the expression. i.e. `auto X = new MyType();` is fine as you see immediately what type it is, I'd tend to avoid other cases where the type is not immediately clear. Though I'd make an exception for overly long/verbose types (naming iterators in containers).

================
Comment at: include/llvm/ADT/TerminatedString.h:377
+/// compound string. TerminatedStringLists act exactly like
+/// normal TerminatedStrings, but allow us to masize_tain the indivIduality of
+/// specific strings.
----------------
too much search&replace of `int` ;-)

================
Comment at: include/llvm/ADT/TerminatedString.h:380-381
+
+template <typename InputContainer, typename CharLike>
+struct TerminatedStringList {
+private:
----------------
Why not `class`?

The type itself doesn't appear to need `InputContainer`, it could probably be move to the few functions using it.

Similar question as TerminatedString: Does it provide enough custom invariants/behaviour that it warrants an own class instead of just using SmallVector<TerminatedString> and providing a few convenience functions here.

Also is this class expected to be used outside the implementation of the SuffixTree? Maybe we can keep it internal to that?

================
Comment at: include/llvm/ADT/TerminatedString.h:383-384
+private:
+  size_t SizeChars;   // Sum of size() for each string in list
+  size_t SizeStrings; // Number of strings in the list
+
----------------
doxygen comment should be `///<`.

================
Comment at: include/llvm/ADT/TerminatedString.h:404
+
+  /// TODO: Iterators over the characters in the string
+
----------------
there are iterators, aren't there?
Also doesn't need to be doxygen.

================
Comment at: include/llvm/ADT/TerminatedString.h:408
+
+  /// This returns the character at index i, treating the entire
+  /// TerminatedStringList
----------------
it's index `Idx`, it's also good style in doxygen to use `\p Idx` to make it obvious that you are referencing a parameter.

================
Comment at: include/llvm/ADT/TerminatedString.h:482-495
+    assert(Idx < SizeChars && "Out of range! (Idx >= size_chars)");
+    /// Find the string index containing i.
+    size_t StrIdx;
+    for (StrIdx = 0; StrIdx != size() - 1; StrIdx++) {
+      /// Idx must be insIde the string, since it's in the range [0, size-1].
+      if (Idx < Str[StrIdx]->size())
+        break;
----------------
Is accessing strings at "global" indexes common? In that case this is an overly slow operation (O(n) - though with small constant - access) and we should try to move to a model where all strings are actually concatenated together in a single array better.

================
Comment at: include/llvm/ADT/TerminatedString.h:551
+
+  for (auto It = TS.begin(), Et = TS.end(); It != Et; It++)
+    OS << "'" << **It << "'"
----------------
Use range based for.

================
Comment at: include/llvm/ADT/TerminatedString.h:552-553
+  for (auto It = TS.begin(), Et = TS.end(); It != Et; It++)
+    OS << "'" << **It << "'"
+       << ", ";
+
----------------
put it on a single line? Prefer `'C'` over `"C"`.

================
Comment at: unittests/ADT/TerminatedStringListTest.cpp:1
+/// Unit tests for TerminatedStringLists
+
----------------
Should have an llvm header.

================
Comment at: unittests/ADT/TerminatedStringListTest.cpp:214
+}
\ No newline at end of file

----------------
"No newline at end of file"

================
Comment at: unittests/ADT/TerminatedStringTest.cpp:1
+/// Unit tests for TerminatedStrings
+
----------------
missing llvm header.

================
Comment at: unittests/ADT/TerminatedStringTest.cpp:28
+
+  /// Add a character to the string
+  void addOneCharacter(StringType &s, const CharType &c) {
----------------
Comments should form full sentences and end with a dot.
Similar in a few other cases.

Repository:
  rL LLVM

https://reviews.llvm.org/D26870