[cfe-dev] [RFC] A C++ pseudo parser for tooling

Sat Nov 6 08:22:34 PDT 2021

On Sat, 6 Nov 2021, 02:00 Andrew Tomazos via cfe-dev, <
cfe-dev at lists.llvm.org> wrote:

> Unfortunately it's not possible to parse C++ even close to accurately
> without preprocessing (and so build-system integration).
>
We're not convinced this is true, if we're talking about "average code".
Our measurements show tree-sitter achieving 95%+ average accuracy on a
large codebase.
(We hope to achieve higher accuracy, better handling of broken code, and
finer-grained info by specializing for C++).

Certainly there are cases where it's not possible to parse without both
preprocessing and semantic analysis, but these aren't most code. The
strategy here is to make informed guesses and rely on error-tolerance to
avoid too much fallout from a bad guess. (This is the third category of
error listed under error-resilience in the doc).

  Macros can expand to an arbitrary token sequence (or even create new
> tokens through stringization or concatenation).  It means that any
> identifier can become any token sequence.
>
  That's even before we mention how name lookup is needed for
> disambiguation.  To parse C++ you in fact need to do full preprocessing and
> a large chunk of semantic analysis.
>
These are covered in some detail in the design document, I'd be interested
in your thoughts there, especially real-world examples that are important
and not solvable in this way.
(Though yes, we expect to get some cases wrong and to fail catastrophically
on code where PP is used in unidiomatic ways, just as clang-format does).

Given how inaccurate the parse from the best possible "single source file"
> parser is - it's not clear what the use case is for it.
>
Some use cases are listed in the doc, granted if the parse is too
inaccurate it won't be useful for them.
FWIW several of these use-cases are places where we're using regexes today.

  clang-format (largely) only makes whitespace changes, so there is limited
> opportunity for inaccuracies in its parse to lead to errors.
>
Sure. It can lead to style errors though. We enforce both clang-format and
a style guide on a large part of our codebase, and it works.
Of course this is only weak evidence as clang-format must infer much less
structure.

To generate file outlines and do refactoring I suspect you're better off
> waiting for a proper parse than using a completely inaccurate one.
>
Funny you should mention :-) clangd does provide an AST based outline, and
it's great. For our internal deployment, the editor team decided to go with
a (closed-source, relatively simple) pseudo-parser outline instead. It was
worse, but OK, and having it immediately available was judged more
important.
This made me pretty sad but I find it hard to disagree.

  In the dev environment I use, past versions of the indexer had tried to
> do such an approximate parse, and current versions do a full correct C++
> parse, so I've experienced the difference first-hand.  It's night and day.
>
Agree. This is why we have an AST-based indexer (and many flavors of it,
just-in-time, background, networked). This won't go away.
However the time to build that index can be a night and a day, too. People
edit large codebases on small laptops...
We think this can be two orders of magnitude faster. If there's a way to do
that with clang, I'd love to hear it!

> Just my 2c.  -Andrew
>
> On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hello everyone,
>>
>> We’d like to propose a pseudo-parser which can approximately parse C++
>> (including broken code). It parses a file in isolation, without needing
>> headers, compile flags etc. Ambiguities are resolved heuristically, like
>> clang-format. Its output is a clang::syntax tree, which maps the token
>> sequence onto the C++ grammar.
>> Our motivation comes from wanting to add some low latency features (file
>> outline, refactorings etc) in clangd, but we think this is a useful
>> building block for other tools too.
>>
>> Design is discussed in detail here:
>> https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing
>>
>> The proposal is based on the experience with a working prototype.
>> Initially, we will focus on building the foundation. We consider the first
>> version as experimental, and plan to use and validate it with applications
>> in clangd (the detailed plan is described here
>> <https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j>
>> ).
>>
>> As soon as we have consensus on the proposal, we plan to start this work
>> in the clang repository (code would be under clang/Tooling/Syntax). We hope
>> we can start sending out patches for review at the end of November.
>>
>> Eager to hear your thoughts. Comments and suggestions are much
>> appreciated.
>>
>> Thanks,
>> Haojian
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211106/3a0a1546/attachment-0001.html>