[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Fri Dec 8 03:32:04 PST 2017

hokein added a comment.

Thanks for the useful comments!

================
Comment at: clangd/Symbol.h:1
+//===--- Symbol.h -----------------------------------------------*- C++-*-===//
+//
----------------
sammccall wrote:
> sammccall wrote:
> > I think that:
> >  - there's other places in clangd that deal with symbols too, this is specifically for indexing
> >  - the index interface belongs alongside Symbol
> > I'd suggest calling this Index.h
> I don't think having `Symbol`s be completely self-contained objects and passing them around in standard containers like `set`s will prove to be ideal.
> It means they can't share storage for e.g. location filename, that it's hard for us to arena-allocate them, etc.
> 
> I think we could use the concept of a set of symbols which share a lifetime. An initial version might just be
> 
>     class SymbolSlab {
>     public:
>       using iterator = DenseSet<Symbol>::iterator;
>       iterator begin();
>       iterator end();
>     private:
>       DenseSet<Symbol> Symbols;
>     }
> 
> But it's easy to add `StringPool` etc there.
> Then this is the natural unit of granularity of large sets of symbols:  a dynamic index that deals with one file at a time would operate on (Filename, SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc.
> 
> Then indexes can be built on top of this using non-owning pointers.
+1 It makes sense. 

For initial version, the `Symbol` structure is still owning its fields naively, we could improve it (change to pointer or references) in the future.

================
Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the
----------------
malaperle wrote:
> malaperle wrote:
> > sammccall wrote:
> > > ioeric wrote:
> > > > Is this relative or absolute?
> > > Having every symbol own a copy of the filepath seems wasteful.
> > > It seems likely that the index will have per-file information too, so this representation should likely be a key to that. Hash of the filepath might work?
> > How we model it is that a symbol doesn't have a "location", but its occurrence do. One could consider the location of a symbol to be either its declaration occurrence (SymbolRole::Declaration) or its definition (SymbolRole::Definition).
> > What we do to get the location path is each occurrence has a pointer (a "database" pointer, but it doesn't matter) to a file entry and then we get the path from the entry.
> > 
> > So conceptually, it works a bit like this (although it fetches information on disk).
> > ```
> > class IndexOccurrence {
> > IndexOccurrence *FilePtr;
> > 
> > std::string Occurrence::getPath() {
> >   return FilePtr->getPath();
> > }
> > };
> > ```
> Oops, wrong type for the field, it should have been:
> ```
> class IndexOccurrence {
> IndexFile *FilePtr;
> 
> std::string Occurrence::getPath() {
>   return FilePtr->getPath();
> }
> };
> ```
> Is this relative or absolute?

 Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files).

> How we model it is that a symbol doesn't have a "location", but its occurrence do.

We will also have a SymbolOccurence structure alongside with Symbol (but it is not  in this patch). The "Location" will be a part of SymbolOccurrence.

================
Comment at: clangd/Symbol.h:26
+  // source file.
+  unsigned StartOffset;
+  // The offset to the last character of the symbol from the beginning of the
----------------
ioeric wrote:
> 0-based or 1-based?
The offset is equivalent to FileOffset in `SourceLocation`. I can't find any document about the FileOffset in LLVM, but it is 0-based.

================
Comment at: clangd/Symbol.h:32
+
+struct CodeCompletionInfo {
+  // FIXME: add fields here.
----------------
sammccall wrote:
> Let's not add this until we know what's in it.
> There's gong to be an overlap between information needed for CC and other use cases, so this structure might not help the user navigate.
Removed it.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
malaperle wrote:
> sammccall wrote:
> > hokein wrote:
> > > malaperle wrote:
> > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > 
> > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > We can't load all the symbol occurrences into the memory since they are too large
> > 
> > I've heard this often, but never backed up by data :-)
> > 
> > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > 
> > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > 
> > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> 
I can see two points here:

1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  

================
Comment at: clangd/Symbol.h:39
+  // The symbol identifier, using USR.
+  std::string Identifier;
+  // The qualified name of the symbol, e.g. Foo::bar.
----------------
sammccall wrote:
> ioeric wrote:
> > It might make sense to just call this `USR` to be more explicit.
> USRs are large. Can we use a fixed-size hash?
USR is one of the implementations of the Identifier, I would keep the name here -- we might change the implementation (like hash(USR)) afterwards.

We can use  `hash` would make the query symbol from USR  a bit harder -- we have to maintain a hash-to-USR lookup table.  I think we can do it later when it shows a large memory consumption?

================
Comment at: clangd/Symbol.h:51
+  //   * For classes, the canonical location is where they are defined.
+  SymbolLocation CanonicalLocation;
+  // Information for code completion.
----------------
ioeric wrote:
> For functions and classes, should we store both declaration and definition locations?
I think the symbol occurrences would include both declaration and definition locations. 

The `CanonicalLocation` is providing a fast/convenient way to find the most interested location of the symbol (e.g. for code navigation, or include the missing path for a symbol), without symbol occurrences. 

================
Comment at: clangd/Symbol.h:55
+
+  bool operator<(const Symbol& S) const {
+    return Identifier < S.Identifier;
----------------
sammccall wrote:
> I'd suggest == and a hash function instead, unless we think this ordering is particularly meaningful?
It was required when using `Symbol` structure in a set standard library. Removed it since we are using the `DenseSet` now.

================
Comment at: clangd/Symbol.h:68
+// changed.
+class SymbolCollector : public index::IndexDataConsumer {
+public:
----------------
sammccall wrote:
> Please pull this into a separate file. Someone providing e.g. symbols from a YAML file shouldn't need to pull in AST stuff.
> Mabye `IndexFromAST`, which would sort nicely next to `Index`?
I can see various meaning of "Index" here:

1. `Index` in `index::IndexDataConsumer`, which collects and contructs all symbols by traversing the AST.
2. `Index` term using in clangd, specially for build index on top of these collected symbols.

I think we should be consistent the index for 2), and `SymbolCollector` is more descriptive here.

================
Comment at: clangd/Symbol.h:74
+
+  bool
+  handleDeclOccurence(const Decl *D, index::SymbolRoleSet Roles,
----------------
sammccall wrote:
> what is the boolean for? you always return true
return `true` means continue indexing, while false means abort the indexing. Have added the comment.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897