[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Dec 7 02:28:31 PST 2017


sammccall added a comment.

Thanks for putting this together! Have a bit of a braindump here, happy to discuss further either here or offline.



================
Comment at: clangd/Symbol.h:1
+//===--- Symbol.h -----------------------------------------------*- C++-*-===//
+//
----------------
I think that:
 - there's other places in clangd that deal with symbols too, this is specifically for indexing
 - the index interface belongs alongside Symbol
I'd suggest calling this Index.h


================
Comment at: clangd/Symbol.h:1
+//===--- Symbol.h -----------------------------------------------*- C++-*-===//
+//
----------------
sammccall wrote:
> I think that:
>  - there's other places in clangd that deal with symbols too, this is specifically for indexing
>  - the index interface belongs alongside Symbol
> I'd suggest calling this Index.h
I don't think having `Symbol`s be completely self-contained objects and passing them around in standard containers like `set`s will prove to be ideal.
It means they can't share storage for e.g. location filename, that it's hard for us to arena-allocate them, etc.

I think we could use the concept of a set of symbols which share a lifetime. An initial version might just be

    class SymbolSlab {
    public:
      using iterator = DenseSet<Symbol>::iterator;
      iterator begin();
      iterator end();
    private:
      DenseSet<Symbol> Symbols;
    }

But it's easy to add `StringPool` etc there.
Then this is the natural unit of granularity of large sets of symbols:  a dynamic index that deals with one file at a time would operate on (Filename, SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc.

Then indexes can be built on top of this using non-owning pointers.


================
Comment at: clangd/Symbol.h:23
+  // The path of the source file where a symbol occurs.
+  std::string FilePath;
+  // The offset to the first character of the symbol from the beginning of the
----------------
ioeric wrote:
> Is this relative or absolute?
Having every symbol own a copy of the filepath seems wasteful.
It seems likely that the index will have per-file information too, so this representation should likely be a key to that. Hash of the filepath might work?


================
Comment at: clangd/Symbol.h:32
+
+struct CodeCompletionInfo {
+  // FIXME: add fields here.
----------------
Let's not add this until we know what's in it.
There's gong to be an overlap between information needed for CC and other use cases, so this structure might not help the user navigate.


================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
hokein wrote:
> malaperle wrote:
> > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> 
> We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> We can't load all the symbol occurrences into the memory since they are too large

I've heard this often, but never backed up by data :-)

Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.

If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.

(Maybe there are edge cases with macros/templates, but we can keep them under control)


================
Comment at: clangd/Symbol.h:39
+  // The symbol identifier, using USR.
+  std::string Identifier;
+  // The qualified name of the symbol, e.g. Foo::bar.
----------------
ioeric wrote:
> It might make sense to just call this `USR` to be more explicit.
USRs are large. Can we use a fixed-size hash?


================
Comment at: clangd/Symbol.h:55
+
+  bool operator<(const Symbol& S) const {
+    return Identifier < S.Identifier;
----------------
I'd suggest == and a hash function instead, unless we think this ordering is particularly meaningful?


================
Comment at: clangd/Symbol.h:68
+// changed.
+class SymbolCollector : public index::IndexDataConsumer {
+public:
----------------
Please pull this into a separate file. Someone providing e.g. symbols from a YAML file shouldn't need to pull in AST stuff.
Mabye `IndexFromAST`, which would sort nicely next to `Index`?


================
Comment at: clangd/Symbol.h:72
+
+  const std::set<Symbol> &getSymbols() const { return Symbols; }
+
----------------
do you really mean to copy here?


================
Comment at: clangd/Symbol.h:74
+
+  bool
+  handleDeclOccurence(const Decl *D, index::SymbolRoleSet Roles,
----------------
what is the boolean for? you always return true


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897





More information about the cfe-commits mailing list