<div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 6 Nov 2021, 02:00 Andrew Tomazos via cfe-dev, <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank" rel="noreferrer">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Unfortunately it's not possible to parse C++ even close to accurately without preprocessing (and so build-system integration).</div></div></blockquote></div></div><div dir="auto">We're not convinced this is true, if we're talking about "average code".</div><div dir="auto">Our measurements show tree-sitter achieving 95%+ average accuracy on a large codebase.</div><div dir="auto">(We hope to achieve higher accuracy, better handling of broken code, and finer-grained info by specializing for C++).</div><div dir="auto"><br></div><div dir="auto">Certainly there are cases where it's not possible to parse without both preprocessing and semantic analysis, but these aren't most code. The strategy here is to make informed guesses and rely on error-tolerance to avoid too much fallout from a bad guess. (This is the third category of error listed under error-resilience in the doc).</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div> Macros can expand to an arbitrary token sequence (or even create new tokens through stringization or concatenation). It means that any identifier can become any token sequence.</div></div></blockquote></div></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div> That's even before we mention how name lookup is needed for disambiguation. To parse C++ you in fact need to do full preprocessing and a large chunk of semantic analysis.</div></div></blockquote></div></div><div dir="auto">These are covered in some detail in the design document, I'd be interested in your thoughts there, especially real-world examples that are important and not solvable in this way.</div><div dir="auto">(Though yes, we expect to get some cases wrong and to fail catastrophically on code where PP is used in unidiomatic ways, just as clang-format does).</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Given how inaccurate the parse from the best possible "single source file" parser is - it's not clear what the use case is for it.</div></div></blockquote></div></div><div dir="auto">Some use cases are listed in the doc, granted if the parse is too inaccurate it won't be useful for them.</div><div dir="auto">FWIW several of these use-cases are places where we're using regexes today.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div> clang-format (largely) only makes whitespace changes, so there is limited opportunity for inaccuracies in its parse to lead to errors.</div></div></blockquote></div></div><div dir="auto">Sure. It can lead to style errors though. We enforce both clang-format and a style guide on a large part of our codebase, and it works.</div><div dir="auto">Of course this is only weak evidence as clang-format must infer much less structure.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>To generate file outlines and do refactoring I suspect you're better off waiting for a proper parse than using a completely inaccurate one.</div></div></blockquote></div></div><div dir="auto">Funny you should mention :-) clangd does provide an AST based outline, and it's great. For our internal deployment, the editor team decided to go with a (closed-source, relatively simple) pseudo-parser outline instead. It was worse, but OK, and having it immediately available was judged more important.</div><div dir="auto">This made me pretty sad but I find it hard to disagree.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div> In the dev environment I use, past versions of the indexer had tried to do such an approximate parse, and current versions do a full correct C++ parse, so I've experienced the difference first-hand. It's night and day.</div></div></blockquote></div></div><div dir="auto">Agree. This is why we have an AST-based indexer (and many flavors of it, just-in-time, background, networked). This won't go away.</div><div dir="auto">However the time to build that index can be a night and a day, too. People edit large codebases on small laptops...</div><div dir="auto">We think this can be two orders of magnitude faster. If there's a way to do that with clang, I'd love to hear it!</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><br></div><div>Just my 2c. -Andrew</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello everyone,<div><br></div><div><span id="m_914876384038888412m_6513957415922095481gmail-m_-5983911854010216040gmail-docs-internal-guid-dbfa41e6-7fff-8f94-dd81-5d40a71a0ea5">We’d like to propose a pseudo-parser which can approximately parse C++ (including broken code). It parses a file in isolation, without needing headers, compile flags etc. Ambiguities are resolved heuristically, like clang-format. Its output is a clang::syntax tree, which maps the token sequence onto the C++ grammar.<br>Our motivation comes from wanting to add some low latency features (file outline, refactorings etc) in clangd, but we think this is a useful building block for other tools too.</span></div><div><br></div><div>Design is discussed in detail here: <a href="https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing" style="text-decoration-line:none" rel="noreferrer noreferrer" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing</span></a><br><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt">The proposal is based on the experience with a working prototype. Initially, we will focus on building the foundation. We consider the first version as experimental, and plan to use and validate it with applications in clangd (the detailed plan is described <a href="https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j" rel="noreferrer noreferrer" target="_blank">here</a>).</p><br>As soon as we have consensus on the proposal, we plan to start this work in the clang repository (code would be under clang/Tooling/Syntax). We hope we can start sending out patches for review at the end of November.<br><br>Eager to hear your thoughts. Comments and suggestions are much appreciated.</div><div><br></div><div><span>Thanks,</span></div><div><span>Haojian</span></div></div>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div></div>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div></div></div>