[cfe-dev] [RFC] A C++ pseudo parser for tooling

Mon Nov 8 03:18:30 PST 2021

On Sun, Nov 7, 2021 at 2:06 AM Andrew Tomazos <andrewtomazos at gmail.com>
wrote:

> * People tend to think "idiomatic average code" is whatever biased tiny
> sample of code they have seen is (search for Bjarne's adaptation of
> elephant and blind man parable).  I estimate there are 10 billion lines of
> production C++ in the world, and many accuse that as an underestimate by as
> much as 10x (see http://www.tomazos.com/howmuchcpp.pdf).  So Google has
> between 0.2-2.0% of it (and I've seen that too BTW - I'm a former SWE).
> You'll find due to various influences (like
> styleguide/monorepo/culture) Googles code is somewhat more homogeneous than
> the larger population (this can be seen via open-sourced Google code - not
> relying on proprietary knowledge).  If you really want to test it properly
> I would use the ACTCD19 dataset (see
> https://codesearch.isocpp.org/faq.html).  But I understand it would be
> difficult to set up as getting the 10,000s of packages building with clang
> (for baseline comparison) rather than gcc is non-trivial (as far as I know).
>

Yes, this is a good and interesting point. Our measurement was based on two
datasets (one for google-style internal code, one for LLVM open-source
code), I'd admit that they might not reflect the whole C++ world.
I agree that we should cover "general" C++ code. We could run more
experiments/measurements on 3rd-party code (happy to do that if we need
more data), in practice I think we might follow the devflow like
clang-format -- we start with google-style/llvm-style code, extend and
polish the parser to support more general code based on users' feedback and
bug reports.

* You say that the use cases are listed in the design document.  What
> section/page are you referring to?  I couldn't find them.
>

I think they are (implicitly) mentioned in the "Scope" and "Work plan"
sections (will make them clearer).

IDE use cases (for clangd)
-  provide code-folding, outline, syntax highlighting, selection features
without a long "warmup" time;
-  a fast index to provides approximate results;

Other use cases we aim to support:
- smart diff and merge tool for C++ code;
- a fast linter, a cpplint replacement, with clang-tidy-like extensibility;
- syntactic grep/sed tools;

Even if we accepted that the accuracy was 95% (which kind of sounds, due to
> its roundness, like a made-up stat to be honest - plus it's not clear what
> the denominator/unit you're using is)
>

The accuracy measurement is comparing annotation ranges generated from
different parsers (tree-sitter vs the clang-AST), the accuracy is
calculated based on the number of perfectly-matched ranges and mismatched
ranges.
The annotations include some critical pieces of C++ source code:
- identifiers that introduce a new source code entity: variable, function,
class, enum, namespace
- curly brace structure: class-body, compound-stmt-body, enum-body,
initializer-list.

> for most use cases I can think of screwing up 5% of the time would be
> grossly unacceptable and counter-productive.  Programming is hard, and the
> last thing you need is error-prone tooling making it harder.
>

I'm not sure this is true. I think this mainly depends on the use cases.
Our internal editor team has their own C++ pseudo-parser (which is written
for instant outline and indexing symbols). It has been used for years, and
they're happy with that.
IMO letting users wait a few minutes in a newly-opened editor until all IDE
features are available is a poor experience.

On Sat, Nov 6, 2021 at 3:22 PM Sam McCall <sammccall at google.com> wrote:
>
>> On Sat, 6 Nov 2021, 02:00 Andrew Tomazos via cfe-dev, <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> Unfortunately it's not possible to parse C++ even close to accurately
>>> without preprocessing (and so build-system integration).
>>>
>> We're not convinced this is true, if we're talking about "average code".
>> Our measurements show tree-sitter achieving 95%+ average accuracy on a
>> large codebase.
>> (We hope to achieve higher accuracy, better handling of broken code, and
>> finer-grained info by specializing for C++).
>>
>> Certainly there are cases where it's not possible to parse without both
>> preprocessing and semantic analysis, but these aren't most code. The
>> strategy here is to make informed guesses and rely on error-tolerance to
>> avoid too much fallout from a bad guess. (This is the third category of
>> error listed under error-resilience in the doc).
>>
>>   Macros can expand to an arbitrary token sequence (or even create new
>>> tokens through stringization or concatenation).  It means that any
>>> identifier can become any token sequence.
>>>
>>   That's even before we mention how name lookup is needed for
>>> disambiguation.  To parse C++ you in fact need to do full preprocessing and
>>> a large chunk of semantic analysis.
>>>
>> These are covered in some detail in the design document, I'd be
>> interested in your thoughts there, especially real-world examples that are
>> important and not solvable in this way.
>> (Though yes, we expect to get some cases wrong and to fail
>> catastrophically on code where PP is used in unidiomatic ways, just as
>> clang-format does).
>>
>> Given how inaccurate the parse from the best possible "single source
>>> file" parser is - it's not clear what the use case is for it.
>>>
>> Some use cases are listed in the doc, granted if the parse is too
>> inaccurate it won't be useful for them.
>> FWIW several of these use-cases are places where we're using regexes
>> today.
>>
>>   clang-format (largely) only makes whitespace changes, so there is
>>> limited opportunity for inaccuracies in its parse to lead to errors.
>>>
>> Sure. It can lead to style errors though. We enforce both clang-format
>> and a style guide on a large part of our codebase, and it works.
>> Of course this is only weak evidence as clang-format must infer much less
>> structure.
>>
>> To generate file outlines and do refactoring I suspect you're better off
>>> waiting for a proper parse than using a completely inaccurate one.
>>>
>> Funny you should mention :-) clangd does provide an AST based outline,
>> and it's great. For our internal deployment, the editor team decided to go
>> with a (closed-source, relatively simple) pseudo-parser outline instead. It
>> was worse, but OK, and having it immediately available was judged more
>> important.
>> This made me pretty sad but I find it hard to disagree.
>>
>>   In the dev environment I use, past versions of the indexer had tried to
>>> do such an approximate parse, and current versions do a full correct C++
>>> parse, so I've experienced the difference first-hand.  It's night and day.
>>>
>> Agree. This is why we have an AST-based indexer (and many flavors of it,
>> just-in-time, background, networked). This won't go away.
>> However the time to build that index can be a night and a day, too.
>> People edit large codebases on small laptops...
>> We think this can be two orders of magnitude faster. If there's a way to
>> do that with clang, I'd love to hear it!
>>
>>
>>> Just my 2c.  -Andrew
>>>
>>> On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> We’d like to propose a pseudo-parser which can approximately parse C++
>>>> (including broken code). It parses a file in isolation, without needing
>>>> headers, compile flags etc. Ambiguities are resolved heuristically, like
>>>> clang-format. Its output is a clang::syntax tree, which maps the token
>>>> sequence onto the C++ grammar.
>>>> Our motivation comes from wanting to add some low latency features
>>>> (file outline, refactorings etc) in clangd, but we think this is a useful
>>>> building block for other tools too.
>>>>
>>>> Design is discussed in detail here:
>>>> https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing
>>>>
>>>> The proposal is based on the experience with a working prototype.
>>>> Initially, we will focus on building the foundation. We consider the first
>>>> version as experimental, and plan to use and validate it with applications
>>>> in clangd (the detailed plan is described here
>>>> <https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j>
>>>> ).
>>>>
>>>> As soon as we have consensus on the proposal, we plan to start this
>>>> work in the clang repository (code would be under clang/Tooling/Syntax). We
>>>> hope we can start sending out patches for review at the end of November.
>>>>
>>>> Eager to hear your thoughts. Comments and suggestions are much
>>>> appreciated.
>>>>
>>>> Thanks,
>>>> Haojian
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211108/8ebfcb81/attachment-0001.html>