[cfe-dev] [RFC] A C++ pseudo parser for tooling

Sat Nov 6 18:06:15 PDT 2021

* People tend to think "idiomatic average code" is whatever biased tiny
sample of code they have seen is (search for Bjarne's adaptation of
elephant and blind man parable).  I estimate there are 10 billion lines of
production C++ in the world, and many accuse that as an underestimate by as
much as 10x (see http://www.tomazos.com/howmuchcpp.pdf).  So Google has
between 0.2-2.0% of it (and I've seen that too BTW - I'm a former SWE).
You'll find due to various influences (like
styleguide/monorepo/culture) Googles code is somewhat more homogeneous than
the larger population (this can be seen via open-sourced Google code - not
relying on proprietary knowledge).  If you really want to test it properly
I would use the ACTCD19 dataset (see https://codesearch.isocpp.org/faq.html).
But I understand it would be difficult to set up as getting the 10,000s of
packages building with clang (for baseline comparison) rather than gcc is
non-trivial (as far as I know).

* You say that the use cases are listed in the design document.  What
section/page are you referring to?  I couldn't find them.  Even if we
accepted that the accuracy was 95% (which kind of sounds, due to its
roundness, like a made-up stat to be honest - plus it's not clear what the
denominator/unit you're using is), for most use cases I can think of
screwing up 5% of the time would be grossly unacceptable and
counter-productive.  Programming is hard, and the last thing you need is
error-prone tooling making it harder.

On Sat, Nov 6, 2021 at 3:22 PM Sam McCall <sammccall at google.com> wrote:

> On Sat, 6 Nov 2021, 02:00 Andrew Tomazos via cfe-dev, <
> cfe-dev at lists.llvm.org> wrote:
>
>> Unfortunately it's not possible to parse C++ even close to accurately
>> without preprocessing (and so build-system integration).
>>
> We're not convinced this is true, if we're talking about "average code".
> Our measurements show tree-sitter achieving 95%+ average accuracy on a
> large codebase.
> (We hope to achieve higher accuracy, better handling of broken code, and
> finer-grained info by specializing for C++).
>
> Certainly there are cases where it's not possible to parse without both
> preprocessing and semantic analysis, but these aren't most code. The
> strategy here is to make informed guesses and rely on error-tolerance to
> avoid too much fallout from a bad guess. (This is the third category of
> error listed under error-resilience in the doc).
>
>   Macros can expand to an arbitrary token sequence (or even create new
>> tokens through stringization or concatenation).  It means that any
>> identifier can become any token sequence.
>>
>   That's even before we mention how name lookup is needed for
>> disambiguation.  To parse C++ you in fact need to do full preprocessing and
>> a large chunk of semantic analysis.
>>
> These are covered in some detail in the design document, I'd be interested
> in your thoughts there, especially real-world examples that are important
> and not solvable in this way.
> (Though yes, we expect to get some cases wrong and to fail
> catastrophically on code where PP is used in unidiomatic ways, just as
> clang-format does).
>
> Given how inaccurate the parse from the best possible "single source file"
>> parser is - it's not clear what the use case is for it.
>>
> Some use cases are listed in the doc, granted if the parse is too
> inaccurate it won't be useful for them.
> FWIW several of these use-cases are places where we're using regexes today.
>
>   clang-format (largely) only makes whitespace changes, so there is
>> limited opportunity for inaccuracies in its parse to lead to errors.
>>
> Sure. It can lead to style errors though. We enforce both clang-format and
> a style guide on a large part of our codebase, and it works.
> Of course this is only weak evidence as clang-format must infer much less
> structure.
>
> To generate file outlines and do refactoring I suspect you're better off
>> waiting for a proper parse than using a completely inaccurate one.
>>
> Funny you should mention :-) clangd does provide an AST based outline, and
> it's great. For our internal deployment, the editor team decided to go with
> a (closed-source, relatively simple) pseudo-parser outline instead. It was
> worse, but OK, and having it immediately available was judged more
> important.
> This made me pretty sad but I find it hard to disagree.
>
>   In the dev environment I use, past versions of the indexer had tried to
>> do such an approximate parse, and current versions do a full correct C++
>> parse, so I've experienced the difference first-hand.  It's night and day.
>>
> Agree. This is why we have an AST-based indexer (and many flavors of it,
> just-in-time, background, networked). This won't go away.
> However the time to build that index can be a night and a day, too. People
> edit large codebases on small laptops...
> We think this can be two orders of magnitude faster. If there's a way to
> do that with clang, I'd love to hear it!
>
>
>> Just my 2c.  -Andrew
>>
>> On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> Hello everyone,
>>>
>>> We’d like to propose a pseudo-parser which can approximately parse C++
>>> (including broken code). It parses a file in isolation, without needing
>>> headers, compile flags etc. Ambiguities are resolved heuristically, like
>>> clang-format. Its output is a clang::syntax tree, which maps the token
>>> sequence onto the C++ grammar.
>>> Our motivation comes from wanting to add some low latency features (file
>>> outline, refactorings etc) in clangd, but we think this is a useful
>>> building block for other tools too.
>>>
>>> Design is discussed in detail here:
>>> https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing
>>>
>>> The proposal is based on the experience with a working prototype.
>>> Initially, we will focus on building the foundation. We consider the first
>>> version as experimental, and plan to use and validate it with applications
>>> in clangd (the detailed plan is described here
>>> <https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j>
>>> ).
>>>
>>> As soon as we have consensus on the proposal, we plan to start this work
>>> in the clang repository (code would be under clang/Tooling/Syntax). We hope
>>> we can start sending out patches for review at the end of November.
>>>
>>> Eager to hear your thoughts. Comments and suggestions are much
>>> appreciated.
>>>
>>> Thanks,
>>> Haojian
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211107/1105a53e/attachment.html>