[cfe-dev] [RFC] A C++ pseudo parser for tooling

Tue Nov 9 06:34:17 PST 2021

On Tue, Nov 9, 2021 at 3:40 AM Andrew Tomazos <andrewtomazos at gmail.com>
wrote:

> On Mon, Nov 8, 2021 at 11:18 AM Haojian Wu <hokein at google.com> wrote:
>
>> IDE use cases (for clangd)
>> -  provide code-folding, outline, syntax highlighting, selection features
>> without a long "warmup" time;
>> -  a fast index to provides approximate results;
>>
>> Other use cases we aim to support:
>> - smart diff and merge tool for C++ code;
>> - a fast linter, a cpplint replacement, with clang-tidy-like
>> extensibility;
>> - syntactic grep/sed tools;
>>
>
> * I don't know what "fast index to provide approximate results" means.
> Results of what?  Do you mean generating an index?  What will the index be
> used for?
>
clangd has a symbol index to enable codebase-wide operations. (see
SymbolIndex
<https://github.com/llvm/llvm-project/blob/main/clang-tools-extra/clangd/index/Index.h>
and
some documentation <https://clangd.llvm.org/design/indexing>)
These include:

   - go-to-definition: finding a definition associated with a declaration
   visible in the AST
   - code completion: for contexts where the AST cannot provide all results
   efficiently, such as namespace scopes (including results from non-included
   headers)
   - cross-references: finding references from files that are not part of
   the current AST

Today this index is built from ASTs in various ways (see docs), which takes
many hours for large codebases (on machines too slow to build).
Most results are missing for a long time. Many users turn off indexing
(e.g. to avoid battery drain) and the results stay missing. If compile flag
metadata is missing for the project, these features don't work at all.

The idea for clangd is to augment (not replace) this index with a
pseudo-parser based index that processes each file once. It would be
halfway between the AST index and grep. This index would provide the same
operations with lower fidelity, and would be replaced by the AST-based
index as it completes.

* Syntax highlighting is the only use case of those listed that can
> tolerate inaccuracy.  For the rest, a correct parse will be more
> productive.  The trouble is that if people start depending on these
> features in their workflow, when they fail (and they often will) it will be
> very disruptive.  The cost of the disruption outweighs the time saved
> waiting for a correct parse.
>
Our experience with clangd is that people very often value latency over
correctness when editing C++ code, and this is a situational, quantitative
question.
As examples, we've failed to replace cpplint and our heuristic outline with
clang-tidy and our AST-based outline. Despite being inaccurate and
incomplete, users find them useful and are not willing to wait.

> * I think you are better off spending your time on optimizing the correct
> parser infrastructure.  I'm sure more can be done - particularly in terms
> of caching, persisting and resusing state (think like PCH and modules etc).
>
We have worked on projects over several years to improve these things (and
other aspects such as error-resilience). We agree there's more that can be
done, and will continue to work on this. We don't believe this approach
will get anywhere near a 100x latency improvement, which is what we're
looking for.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211109/2c61eccf/attachment-0001.html>