[PATCH] D40897: [clangd] Introduce a "Symbol" class.

Fri Dec 8 08:03:20 PST 2017

malaperle added a comment.

(From https://reviews.llvm.org/D40548) Here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

  using USR = llvm::SmallString<256>;

  class ClangdIndexDataOccurrence;

  class ClangdIndexDataSymbol {
  public:
    virtual index::SymbolKind getKind() = 0;
    /// For example, for mynamespace::myclass::mymethod, this will be
    /// mymethod.
    virtual std::string getName() = 0;
    /// For example, for mynamespace::myclass::mymethod, this will be
    /// mynamespace::myclass::
    virtual std::string getQualifier() = 0;
    virtual std::string getUsr() = 0;

    virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

    virtual ~ClangdIndexDataSymbol() = default;
  };

  class ClangdIndexDataOccurrence {
  public:
    enum class OccurrenceType : uint16_t {
       OCCURRENCE,
       DEFINITION_OCCURRENCE
     };

    virtual OccurrenceType getKind() const = 0;
    virtual std::string getPath() = 0;
    /// Get the start offset of the symbol occurrence. The SourceManager can be
    /// used for implementations that need to convert from a line/column
    /// representation to an offset.
    virtual uint32_t getStartOffset(SourceManager &SM) = 0;
    /// Get the end offset of the symbol occurrence. The SourceManager can be
    /// used for implementations that need to convert from a line/column
    /// representation to an offset.
    virtual uint32_t getEndOffset(SourceManager &SM) = 0;
    virtual ~ClangdIndexDataOccurrence() = default;
    //TODO: Add relations

    static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
  };

  /// An occurrence that also has definition with a body that requires additional
  /// locations to keep track of the beginning and end of the body.
  class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
  public:
    virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
    virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

    static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
  };

  class ClangdIndexDataProvider {
  public:

    virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
    virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

    virtual ~ClangdIndexDataProvider() = default;
  };

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

================
Comment at: clangd/Symbol.h:37
+// The class presents a C++ symbol, e.g. class, function.
+struct Symbol {
+  // The symbol identifier, using USR.
----------------
sammccall wrote:
> hokein wrote:
> > malaperle wrote:
> > > sammccall wrote:
> > > > hokein wrote:
> > > > > malaperle wrote:
> > > > > > I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?
> > > > > I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. 
> > > > > 
> > > > > We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on. 
> > > > > We can't load all the symbol occurrences into the memory since they are too large
> > > > 
> > > > I've heard this often, but never backed up by data :-)
> > > > 
> > > > Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.
> > > > 
> > > > If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.
> > > > 
> > > > (Maybe there are edge cases with macros/templates, but we can keep them under control)
> > > I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.
> > > 
> > I can see two points here:
> > 
> > 1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
> > 2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).  
> (This is still a sidebar - not asking for any changes)
> 
> The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
> And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
> If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.
> The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).

Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations.

> 500MB/file for ASTs

What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D40897