[cfe-dev] [RFC] A C++ pseudo parser for tooling

Haojian Wu via cfe-dev cfe-dev at lists.llvm.org
Mon Nov 8 01:55:19 PST 2021


On Sun, Nov 7, 2021 at 6:14 AM David Blaikie via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> On Sat, Nov 6, 2021 at 8:50 AM Sam McCall <sammccall at google.com> wrote:
>
>> On Sat, 6 Nov 2021, 03:36 David Blaikie via cfe-dev, <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> Yeah, FWIW I'd +1 Andrew's comments here - it was sort of one major
>>> premise of clang being designed as a reusable library, that C++ is just too
>>> complicated to reimplement separately/repeatedly in various tools.
>>>
>> Yes. This is a good argument for reusable implementations, but I'm not
>> sure one is enough.
>> c.f. clang-format not using clang beyond the lexer, and the success
>> attributable to that.
>> Ideally we'd share an impl there, in practice its maturity as a product
>> and concrete design choices in its parser combine to make that hard.
>>
>
> Given the long time scale of these things - any chance of a plan to
> converge clang-format and this new thing eventually? (so we have 2 rather
> than 3 versions of C++ understanding in the LLVM project)
>

In an ideal world, we'd like a single pseudo parser, in practice it is
hard, reasons are described in the doc
<https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.fnjjpb3419wk>
-- clang-format's parser is specified
<https://lists.llvm.org/pipermail/cfe-dev/2014-July/038332.html> for
formatting-purpose, and highly coupled with clang-format internals
(extending/refactoring it is possible but hard, and introduces risks to a
mature product); clang-format supports non-c-family languages (supporting
non-c-family languages is not our goal). That said, we don't plan to share
the code with clang-format (at least for initial stages).


> For something's that's going to change significant code - how slow is a
>>> clang-based solution? What's the tradeoff being made?
>>>
>> Basically it's the difference between interactive latency and not.
>> For our internal clangd deployment (because those are the numbers I have)
>> 90%ile is most of a minute to parse headers, and several minutes in the
>> build system to get ready (generated headers, flags...).
>>
>
> How much of this work is equivalent/shared/cached by the build system?
> (eg: if I just did a build, then I wanted to refactor a function - how long
> are we talking there?)
>

If preamble is built, 90%idl of BuildAST is ~1.7s; clangd also caches ASTs,
if a cached AST is hit, AST-based operations (hover, go-to-def, etc) are
generally fast, 95%idl is ~400ms.


> Secondarily, it's the difference between just using the tool and having to
>> "set it up". We do a lot of user support for clangd and I can tell you this
>> is a nontrivial concern. (For people who build with something that's not
>> recent mainline clang/gcc, target weird platforms, don't build on the
>> machine they edit on, use non-cmake build systems, ...)
>>
>
> The second one I have less concern for, I'll admit.
>
>
>>
>> You can see this tradeoff play out in the recent discussions about
>> whether an "east const" conversion belongs in clang-format vs clang-tidy:
>> one of the arguments for putting it in clang-format is it's the only way to
>> make it fast and easy enough that people want to use it.
>>
>>
>>> On Fri, Nov 5, 2021 at 6:00 PM Andrew Tomazos via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> Unfortunately it's not possible to parse C++ even close to accurately
>>>> without preprocessing (and so build-system integration).  There are
>>>> predefined macros that determine what code is conditionally included,
>>>> conditionally included code can change basically anything, redefine
>>>> anything.  Macros can expand to an arbitrary token sequence (or even create
>>>> new tokens through stringization or concatenation).  It means that any
>>>> identifier can become any token sequence.  That's even before we mention
>>>> how name lookup is needed for disambiguation.  To parse C++ you in fact
>>>> need to do full preprocessing and a large chunk of semantic analysis.
>>>>
>>>> Given how inaccurate the parse from the best possible "single source
>>>> file" parser is - it's not clear what the use case is for it.  clang-format
>>>> (largely) only makes whitespace changes, so there is limited opportunity
>>>> for inaccuracies in its parse to lead to errors.
>>>>
>>>> To generate file outlines and do refactoring I suspect you're better
>>>> off waiting for a proper parse than using a completely inaccurate one.  In
>>>> the dev environment I use, past versions of the indexer had tried to do
>>>> such an approximate parse, and current versions do a full correct C++
>>>> parse, so I've experienced the difference first-hand.  It's night and day.
>>>>
>>>> Just my 2c.  -Andrew
>>>>
>>>> On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <
>>>> cfe-dev at lists.llvm.org> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> We’d like to propose a pseudo-parser which can approximately parse C++
>>>>> (including broken code). It parses a file in isolation, without needing
>>>>> headers, compile flags etc. Ambiguities are resolved heuristically, like
>>>>> clang-format. Its output is a clang::syntax tree, which maps the token
>>>>> sequence onto the C++ grammar.
>>>>> Our motivation comes from wanting to add some low latency features
>>>>> (file outline, refactorings etc) in clangd, but we think this is a useful
>>>>> building block for other tools too.
>>>>>
>>>>> Design is discussed in detail here:
>>>>> https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing
>>>>>
>>>>> The proposal is based on the experience with a working prototype.
>>>>> Initially, we will focus on building the foundation. We consider the first
>>>>> version as experimental, and plan to use and validate it with applications
>>>>> in clangd (the detailed plan is described here
>>>>> <https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j>
>>>>> ).
>>>>>
>>>>> As soon as we have consensus on the proposal, we plan to start this
>>>>> work in the clang repository (code would be under clang/Tooling/Syntax). We
>>>>> hope we can start sending out patches for review at the end of November.
>>>>>
>>>>> Eager to hear your thoughts. Comments and suggestions are much
>>>>> appreciated.
>>>>>
>>>>> Thanks,
>>>>> Haojian
>>>>> _______________________________________________
>>>>> cfe-dev mailing list
>>>>> cfe-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211108/1ad2922f/attachment-0001.html>


More information about the cfe-dev mailing list