<div dir="ltr"><div dir="ltr">On Sun, Nov 7, 2021 at 2:06 AM Andrew Tomazos <<a href="mailto:andrewtomazos@gmail.com">andrewtomazos@gmail.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>* People tend to think "idiomatic average code" is whatever biased tiny sample of code they have seen is (search for Bjarne's adaptation of elephant and blind man parable).  I estimate there are 10 billion lines of production C++ in the world, and many accuse that as an underestimate by as much as 10x (see <a href="http://www.tomazos.com/howmuchcpp.pdf" target="_blank">http://www.tomazos.com/howmuchcpp.pdf</a>).  So Google has between 0.2-2.0% of it (and I've seen that too BTW - I'm a former SWE).  You'll find due to various influences (like styleguide/monorepo/culture) Googles code is somewhat more homogeneous than the larger population (this can be seen via open-sourced Google code - not relying on proprietary knowledge).  If you really want to test it properly I would use the ACTCD19 dataset (see <a href="https://codesearch.isocpp.org/faq.html" target="_blank">https://codesearch.isocpp.org/faq.html</a>).  But I understand it would be difficult to set up as getting the 10,000s of packages building with clang (for baseline comparison) rather than gcc is non-trivial (as far as I know).</div></div></blockquote><div><br></div><div>Yes, this is a good and interesting point. Our measurement was based on two datasets (one for google-style internal code, one for LLVM open-source code), I'd admit that they might not reflect the whole C++ world.</div><div>I agree that we should cover "general" C++ code. We could run more experiments/measurements on 3rd-party code (happy to do that if we need more data), in practice I think we might follow the devflow like clang-format -- we start with google-style/llvm-style code, extend and polish the parser to support more general code based on users' feedback and bug reports. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>* You say that the use cases are listed in the design document.  What section/page are you referring to?  I couldn't find them.  </div></div></blockquote><div><br></div><div>I think they are (implicitly) mentioned in the "Scope" and "Work plan" sections (will make them clearer).</div><div><br></div><div>IDE use cases (for clangd)</div><div>-  provide code-folding, outline, syntax highlighting, selection features without a long "warmup" time;</div><div>-  a fast index to provides approximate results;</div><div><br></div><div>Other use cases we aim to support:</div><div>- smart diff and merge tool for C++ code;</div><div>- a fast linter, a cpplint replacement, with clang-tidy-like extensibility;</div><div>- syntactic grep/sed tools;</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Even if we accepted that the accuracy was 95% (which kind of sounds, due to its roundness, like a made-up stat to be honest - plus it's not clear what the denominator/unit you're using is)</div></div></blockquote><div><br></div><div>The accuracy measurement is comparing annotation ranges generated from different parsers (tree-sitter vs the clang-AST), the accuracy is calculated based on the number of perfectly-matched ranges and mismatched ranges.</div><div>The annotations include some critical pieces of C++ source code:</div><div>- identifiers that introduce a new source code entity: variable, function, class, enum, namespace</div>- curly brace structure: class-body, compound-stmt-body, enum-body, initializer-list.<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>for most use cases I can think of screwing up 5% of the time would be grossly unacceptable and counter-productive.  Programming is hard, and the last thing you need is error-prone tooling making it harder.</div></div></blockquote><div><br></div><div>I'm not sure this is true. I think this mainly depends on the use cases. Our internal editor team has their own C++ pseudo-parser (which is written for instant outline and indexing symbols). It has been used for years, and they're happy with that.</div><div>IMO letting users wait a few minutes in a newly-opened editor until all IDE features are available is a poor experience.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Nov 6, 2021 at 3:22 PM Sam McCall <<a href="mailto:sammccall@google.com" target="_blank">sammccall@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 6 Nov 2021, 02:00 Andrew Tomazos via cfe-dev, <<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Unfortunately it's not possible to parse C++ even close to accurately without preprocessing (and so build-system integration).</div></div></blockquote></div></div><div dir="auto">We're not convinced this is true, if we're talking about "average code".</div><div dir="auto">Our measurements show tree-sitter achieving 95%+ average accuracy on a large codebase.</div><div dir="auto">(We hope to achieve higher accuracy, better handling of broken code, and finer-grained info by specializing for C++).</div><div dir="auto"><br></div><div dir="auto">Certainly there are cases where it's not possible to parse without both preprocessing and semantic analysis, but these aren't most code. The strategy here is to make informed guesses and rely on error-tolerance to avoid too much fallout from a bad guess. (This is the third category of error listed under error-resilience in the doc).</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>  Macros can expand to an arbitrary token sequence (or even create new tokens through stringization or concatenation).  It means that any identifier can become any token sequence.</div></div></blockquote></div></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>  That's even before we mention how name lookup is needed for disambiguation.  To parse C++ you in fact need to do full preprocessing and a large chunk of semantic analysis.</div></div></blockquote></div></div><div dir="auto">These are covered in some detail in the design document, I'd be interested in your thoughts there, especially real-world examples that are important and not solvable in this way.</div><div dir="auto">(Though yes, we expect to get some cases wrong and to fail catastrophically on code where PP is used in unidiomatic ways, just as clang-format does).</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Given how inaccurate the parse from the best possible "single source file" parser is - it's not clear what the use case is for it.</div></div></blockquote></div></div><div dir="auto">Some use cases are listed in the doc, granted if the parse is too inaccurate it won't be useful for them.</div><div dir="auto">FWIW several of these use-cases are places where we're using regexes today.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>  clang-format (largely) only makes whitespace changes, so there is limited opportunity for inaccuracies in its parse to lead to errors.</div></div></blockquote></div></div><div dir="auto">Sure. It can lead to style errors though. We enforce both clang-format and a style guide on a large part of our codebase, and it works.</div><div dir="auto">Of course this is only weak evidence as clang-format must infer much less structure.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>To generate file outlines and do refactoring I suspect you're better off waiting for a proper parse than using a completely inaccurate one.</div></div></blockquote></div></div><div dir="auto">Funny you should mention :-) clangd does provide an AST based outline, and it's great. For our internal deployment, the editor team decided to go with a (closed-source, relatively simple) pseudo-parser outline instead. It was worse, but OK, and having it immediately available was judged more important.</div><div dir="auto">This made me pretty sad but I find it hard to disagree.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>  In the dev environment I use, past versions of the indexer had tried to do such an approximate parse, and current versions do a full correct C++ parse, so I've experienced the difference first-hand.  It's night and day.</div></div></blockquote></div></div><div dir="auto">Agree. This is why we have an AST-based indexer (and many flavors of it, just-in-time, background, networked). This won't go away.</div><div dir="auto">However the time to build that index can be a night and a day, too. People edit large codebases on small laptops...</div><div dir="auto">We think this can be two orders of magnitude faster. If there's a way to do that with clang, I'd love to hear it!</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>Just my 2c.  -Andrew</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Nov 5, 2021 at 1:37 PM Haojian Wu via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello everyone,<div><br></div><div><span id="gmail-m_-3622774700579810439gmail-m_6177848758270841311m_914876384038888412m_6513957415922095481gmail-m_-5983911854010216040gmail-docs-internal-guid-dbfa41e6-7fff-8f94-dd81-5d40a71a0ea5">We’d like to propose a pseudo-parser which can approximately parse C++ (including broken code). It parses a file in isolation, without needing headers, compile flags etc. Ambiguities are resolved heuristically, like clang-format. Its output is a clang::syntax tree, which maps the token sequence onto the C++ grammar.<br>Our motivation comes from wanting to add some low latency features (file outline, refactorings etc) in clangd, but we think this is a useful building block for other tools too.</span></div><div><br></div><div>Design is discussed in detail here: <a href="https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing" style="text-decoration-line:none" rel="noreferrer noreferrer" target="_blank"><span style="font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration-line:underline;vertical-align:baseline;white-space:pre-wrap">https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit?usp=sharing</span></a><br><br><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt">The proposal is based on the experience with a working prototype. Initially, we will focus on building the foundation. We consider the first version as experimental, and plan to use and validate it with applications in clangd (the detailed plan is described <a href="https://docs.google.com/document/d/1eGkTOsFja63wsv8v0vd5JdoTonj-NlN3ujGF0T7xDbM/edit#heading=h.mawgmexy688j" rel="noreferrer noreferrer" target="_blank">here</a>).</p><br>As soon as we have consensus on the proposal, we plan to start this work in the clang repository (code would be under clang/Tooling/Syntax). We hope we can start sending out patches for review at the end of November.<br><br>Eager to hear your thoughts. Comments and suggestions are much appreciated.</div><div><br></div><div><span>Thanks,</span></div><div><span>Haojian</span></div></div>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div></div>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div></div></div>

</blockquote></div></div>

</blockquote></div></div>