[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Fri Dec 8 05:43:45 PST 2017

sammccall added a comment.

More comments, but only two major things really:

- I'd like to clearly separate USR from SymbolID (even if you want to keep using USRs as their basis for now)
- the file organization (code within files, and names of files) needs some work I think

Everything else is details, this looks good

================
Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the
----------------
hokein wrote:
> malaperle wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > ioeric wrote:
> > > > > Is this relative or absolute?
> > > > Having every symbol own a copy of the filepath seems wasteful.
> > > > It seems likely that the index will have per-file information too, so this representation should likely be a key to that. Hash of the filepath might work?
> > > How we model it is that a symbol doesn't have a "location", but its occurrence do. One could consider the location of a symbol to be either its declaration occurrence (SymbolRole::Declaration) or its definition (SymbolRole::Definition).
> > > What we do to get the location path is each occurrence has a pointer (a "database" pointer, but it doesn't matter) to a file entry and then we get the path from the entry.
> > > 
> > > So conceptually, it works a bit like this (although it fetches information on disk).
> > > ```
> > > class IndexOccurrence {
> > > IndexOccurrence *FilePtr;
> > > 
> > > std::string Occurrence::getPath() {
> > >   return FilePtr->getPath();
> > > }
> > > };
> > > ```
> > Oops, wrong type for the field, it should have been:
> > ```
> > class IndexOccurrence {
> > IndexFile *FilePtr;
> > 
> > std::string Occurrence::getPath() {
> >   return FilePtr->getPath();
> > }
> > };
> > ```
> > Is this relative or absolute?
> 
>  Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files).
> 
> > How we model it is that a symbol doesn't have a "location", but its occurrence do.
> 
> We will also have a SymbolOccurence structure alongside with Symbol (but it is not  in this patch). The "Location" will be a part of SymbolOccurrence.
>  
> Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files).

I'm not convinced this actually works. There's multiple codepaths to the index, how can we ensure we don't end up using inconsistent paths?
e.g. we open up a project that includes a system header using a relative path, and then open up that system header from file->open (editor knows only the absolute path) and do "find references".

I think we need to canonicalize the paths. Absolute is probably easiest.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
hokein wrote:
> malaperle wrote:
> > sammccall wrote:
> > > hokein wrote:
> > > > malaperle wrote:
> > > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > > 
> > > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > > We can't load all the symbol occurrences into the memory since they are too large
> > > 
> > > I've heard this often, but never backed up by data :-)
> > > 
> > > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > > 
> > > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > > 
> > > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> > I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> > 
> I can see two points here:
> 
> 1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
> 2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  
(This is still a sidebar - not asking for any changes)

The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.

================
Comment at: clangd/Symbol.h:39
+  // The symbol identifier, using USR.
+  std::string Identifier;
+  // The qualified name of the symbol, e.g. Foo::bar.
----------------
hokein wrote:
> sammccall wrote:
> > ioeric wrote:
> > > It might make sense to just call this `USR` to be more explicit.
> > USRs are large. Can we use a fixed-size hash?
> USR is one of the implementations of the Identifier, I would keep the name here -- we might change the implementation (like hash(USR)) afterwards.
> 
> We can use  `hash` would make the query symbol from USR  a bit harder -- we have to maintain a hash-to-USR lookup table.  I think we can do it later when it shows a large memory consumption?
Personally I'm less worried about the name, and more about the type.

I think it's important:
 1. that we have a distinct SymbolID type (for conceptual clarity)
 2. that we convert to/from this type in as few places as possible (for flexibility)
 3. that this type be fixed-size, preferably <=64 bits (for performance)
 4. that long-lived references to symbols be expressed as SymbolIDs not pointers (to avoid lifetime confusion)

I'm happy to defer some performance considerations to later (though I'm almost certain this will matter). But i'm not sure it changes the conclusion:

> We can use  hash would make the query symbol from USR a bit harder -- we have to maintain a hash-to-USR lookup table. 

You only need that if you need a USR. But the SymbolID -> USR conversion should be in one place anyway (point 2) otherwise we'll end up coupled to using USRs forever.
And the SymbolID -> Symbol lookup is something the index is going to provide anyway, right?

================
Comment at: clangd/Symbol.h:51
+  //   * For classes, the canonical location is where they are defined.
+  SymbolLocation CanonicalLocation;
+  // Information for code completion.
----------------
hokein wrote:
> ioeric wrote:
> > For functions and classes, should we store both declaration and definition locations?
> I think the symbol occurrences would include both declaration and definition locations. 
> 
> The `CanonicalLocation` is providing a fast/convenient way to find the most interested location of the symbol (e.g. for code navigation, or include the missing path for a symbol), without symbol occurrences. 
I'd be +1 on including both a definition location (if known) and a preferred declaration location, because there's enough use cases that might want definition even if it's not the preferred declaration.

But i'm fine if we want to omit the separate definition for now. In that case, call this CanonicalDeclaration?

================
Comment at: clangd/Symbol.h:68
+// changed.
+class SymbolCollector : public index::IndexDataConsumer {
+public:
----------------
hokein wrote:
> sammccall wrote:
> > Please pull this into a separate file. Someone providing e.g. symbols from a YAML file shouldn't need to pull in AST stuff.
> > Mabye `IndexFromAST`, which would sort nicely next to `Index`?
> I can see various meaning of "Index" here:
> 
> 1. `Index` in `index::IndexDataConsumer`, which collects and contructs all symbols by traversing the AST.
> 2. `Index` term using in clangd, specially for build index on top of these collected symbols.
> 
> I think we should be consistent the index for 2), and `SymbolCollector` is more descriptive here.
`SymbolCollector` is a fine name for the type, but the file should have `Index` in the name, or we should create an `Index` subdirectory. It should be possible to understand which files are part of the index subsystem by scanning the clangd directory.

(Did you see the comment was about separating into a different file, not about renaming the class?)

================
Comment at: clangd/Symbol.h:74
+
+  bool
+  handleDeclOccurence(const Decl *D, index::SymbolRoleSet Roles,
----------------
hokein wrote:
> sammccall wrote:
> > what is the boolean for? you always return true
> return `true` means continue indexing, while false means abort the indexing. Have added the comment.
Oops, sorry, I missed that this was implementing an interface. Please remove the comment or move it to the implementation - it doesn't make sense in the interface. Sorry for the noise!

================
Comment at: clangd/Symbol.h:70
+ public:
+  using Iterator = llvm::DenseSet<Symbol>::const_iterator;
+
----------------
nit: lowercase `iterator` is the STL convention, following it tends to make template code work better

================
Comment at: clangd/Symbol.h:72
+
+  void addSymbol(Symbol S);
+  bool hasSymbol(StringRef Identifier) const;
----------------
This is dangerous if called after reads, as it invalidates iterators and pointers.
I don't think we actually indend to support such mutation, so I'd suggest adding an explicit freeze() function. addSymbol() is only valid before freeze(), and reading functions are only valid after. An assert can enforce this.
(This is a cheap version of a builder, which are more painful to write but may also be worth it).

If we choose not to enforce this at all, the requirement shold be heavily documented!

================
Comment at: clangd/Symbol.h:73
+  void addSymbol(Symbol S);
+  bool hasSymbol(StringRef Identifier) const;
+
----------------
Awkward as it is, don't you want iterator find() here?

================
Comment at: unittests/clangd/SymbolCollectorTests.cpp:104
+  EXPECT_THAT(Symbols,
+              UnorderedElementsAre(Field(&Symbol::QualifiedName, Eq("Foo")),
+                                   Field(&Symbol::QualifiedName, Eq("Foo::f")),
----------------
If using the same repeatedly, please pull out a MATCHER_P for readability:

UnorderedElementsAre(QName("Foo"), ...)

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897