[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Fri Dec 8 15:29:00 PST 2017

malaperle added inline comments.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
malaperle wrote:
> sammccall wrote:
> > hokein wrote:
> > > malaperle wrote:
> > > > sammccall wrote:
> > > > > hokein wrote:
> > > > > > malaperle wrote:
> > > > > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > > > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > > > > 
> > > > > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > > > > We can't load all the symbol occurrences into the memory since they are too large
> > > > > 
> > > > > I've heard this often, but never backed up by data :-)
> > > > > 
> > > > > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > > > > 
> > > > > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > > > > 
> > > > > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> > > > I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> > > > 
> > > I can see two points here:
> > > 
> > > 1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
> > > 2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  
> > (This is still a sidebar - not asking for any changes)
> > 
> > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> > And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
> > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.
> > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> 
> Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations.
> 
> > 500MB/file for ASTs
> 
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 files at the same time I it was already around 500MB. Hmm, that's a bit alarming.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897