[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Tue Dec 12 08:58:30 PST 2017

malaperle added inline comments.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
sammccall wrote:
> malaperle wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > hokein wrote:
> > > > > malaperle wrote:
> > > > > > sammccall wrote:
> > > > > > > hokein wrote:
> > > > > > > > malaperle wrote:
> > > > > > > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > > > > > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > > > > > > 
> > > > > > > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > > > > > > We can't load all the symbol occurrences into the memory since they are too large
> > > > > > > 
> > > > > > > I've heard this often, but never backed up by data :-)
> > > > > > > 
> > > > > > > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > > > > > > 
> > > > > > > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > > > > > > 
> > > > > > > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> > > > > > I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> > > > > > 
> > > > > I can see two points here:
> > > > > 
> > > > > 1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
> > > > > 2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  
> > > > (This is still a sidebar - not asking for any changes)
> > > > 
> > > > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> > > > And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
> > > > If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.
> > > > The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> > > 
> > > Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations.
> > > 
> > > > 500MB/file for ASTs
> > > 
> > > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> > > What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?
> > 
> > Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 files at the same time I it was already around 500MB. Hmm, that's a bit alarming.
> Right, just that we have to consider RAM usage for the index in the context of clangd's overall requirements - if other parts of clangd use 1G of ram for typical work on a large project, then we shouldn't rule out spending a couple of 100MB on the index if it adds a lot of value.
Agreed we have to consider the overall requirements. I think over 1GB of RAM is not good for our use case, whether is comes from the AST or index. I think it's perfectly normal if we have different requirements but we can see discuss how to design things so there are options to use either more RAM or disk space. It seems the AST would be the most important factor for now so perhaps it's something we should start investigating/discussing.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897