[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Mon Dec 11 03:55:27 PST 2017

sammccall added a comment.

Thanks for the restructuring? I want to take another pass today, but wanted to mention some SymbolID things.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
malaperle wrote:
> malaperle wrote:
> > sammccall wrote:
> > > hokein wrote:
> > > > malaperle wrote:
> > > > > sammccall wrote:
> > > > > > hokein wrote:
> > > > > > > malaperle wrote:
> > > > > > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > > > > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > > > > > 
> > > > > > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > > > > > We can't load all the symbol occurrences into the memory since they are too large
> > > > > > 
> > > > > > I've heard this often, but never backed up by data :-)
> > > > > > 
> > > > > > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > > > > > 
> > > > > > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > > > > > 
> > > > > > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> > > > > I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> > > > > 
> > > > I can see two points here:
> > > > 
> > > > 1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
> > > > 2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  
> > > (This is still a sidebar - not asking for any changes)
> > > 
> > > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> > > And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
> > > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.
> > > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> > 
> > Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations.
> > 
> > > 500MB/file for ASTs
> > 
> > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> 
> Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 files at the same time I it was already around 500MB. Hmm, that's a bit alarming.
Right, just that we have to consider RAM usage for the index in the context of clangd's overall requirements - if other parts of clangd use 1G of ram for typical work on a large project, then we shouldn't rule out spending a couple of 100MB on the index if it adds a lot of value.

================
Comment at: clangd/Symbol.h:72
+
+  void addSymbol(Symbol S);
+  bool hasSymbol(StringRef Identifier) const;
----------------
hokein wrote:
> sammccall wrote:
> > This is dangerous if called after reads, as it invalidates iterators and pointers.
> > I don't think we actually indend to support such mutation, so I'd suggest adding an explicit freeze() function. addSymbol() is only valid before freeze(), and reading functions are only valid after. An assert can enforce this.
> > (This is a cheap version of a builder, which are more painful to write but may also be worth it).
> > 
> > If we choose not to enforce this at all, the requirement shold be heavily documented!
> I think the only user to mutate this object is SymbolCollector.
> 
> Instead of adding `freeze` function, I made it as a private method and declare SymbolCollector as a friend class. Does it look better? 
Hmm, I don't think so.

SymbolCollector being a the only writer is temporary, this is "arena for symbols" and we want to create symbols in other ways (e.g. remote index).

And "friend" to expose one function is a bit iffy.

================
Comment at: clangd/index/Index.h:38
+struct Symbol {
+  // The symbol identifier.
+  unsigned ID;
----------------
This comment doesn't really say anything, and there is a lot to say!
What does it distinguish vs not? How do we guarantee uniqueness?

================
Comment at: clangd/index/Index.h:39
+  // The symbol identifier.
+  unsigned ID;
+  // The qualified name of the symbol, e.g. Foo::bar.
----------------
please make symbol ID a real type, constructible from USR

unsigned might be 32 bits, which really isn't OK. Even 64 is pushing it for huge external indexes.

We need these IDs to be stable across executions and versions, which llvm::HashString is not.

I'd suggest using the SHA1 hash (160 bits)

================
Comment at: clangd/index/Index.h:51
+  //     (not a definition), which is usually declared in ".h" file.
+  SymbolLocation CanonicalDeclarationLoc;
+
----------------
nit: remove Loc? It's ambiguous and not really needed

================
Comment at: clangd/index/Index.h:87
+namespace llvm {
+template <> struct DenseMapInfo<clang::clangd::Symbol> {
+  static inline clang::clangd::Symbol getEmptyKey() {
----------------
You might just want to specialize for SymbolID instead - we should probably mostly use explicit DenseMap<SymbolID, Symbol> as we can avoid constructing the Symbol in a bunch of cases.

================
Comment at: clangd/index/SymbolCollector.cpp:30
+// the SourceManager.
+std::string makeAbsolutePath(const SourceManager &SM, StringRef Path) {
+  llvm::SmallString<128> AbsolutePath(Path);
----------------
maybe resolvePath? since it includes symlink resolution

I'm not sure we actually want to resolve symlinks. Let's chat offline.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897