[cfe-dev] How does clang-format parse snippets?

Stuart Thomson via cfe-dev cfe-dev at lists.llvm.org
Wed Aug 1 06:19:42 PDT 2018


Thanks for this, it confirms what I expected from the behaviour of clang-format. I think that for our use-case we would need more than raw tokens (although I may look into doing it all with tokens only). I guess the next question I would have is:


  *   Is there currently some way to sensibly produce partial ASTs?

I understand that clang uses a recursive descent parser and I was wondering if there is any sensible way to halt the descent – I’m not really sure if the grammar of C++ would even allow for such a thing to make sense. I’ve been looking online for some kind of document explaining how the clang parser actually works and the stages it goes through from source > token > AST; I’ve found it difficult to get an understanding by looking at the clang source code.

Thanks,
Stuart


From: Nico Weber <thakis at chromium.org>
Sent: 01 August 2018 14:01
To: Stuart Thomson <Stuart.Thomson at eu.medical.canon>
Cc: cfe-dev <cfe-dev at lists.llvm.org>
Subject: Re: [cfe-dev] How does clang-format parse snippets?

clang-format is token-based, it sees the raw token stream (without even running the preprocessor, which is why it doesn't need -I and -D flags). clang's cc1 flag -dump-raw-tokens shows you what clang-format sees as input. Note how it #if 0 gets printed instead of evaluated:

$ cat foo.cc
#if 0
asdf
#endif
int f();

$ bin/clang -c -Xclang -dump-raw-tokens foo.cc
hash '#' [StartOfLine]     Loc=<foo.cc:1:1>
raw_identifier 'if'                            Loc=<foo.cc:1:2>
unknown ' '                        Loc=<foo.cc:1:4>
numeric_constant '0'                     Loc=<foo.cc:1:5>
unknown '
'                             Loc=<foo.cc:1:6>
raw_identifier 'asdf'        [StartOfLine]     Loc=<foo.cc:2:1>
unknown '
'                             Loc=<foo.cc:2:5>
hash '#' [StartOfLine]     Loc=<foo.cc:3:1>
raw_identifier 'endif'                     Loc=<foo.cc:3:2>
unknown '

'                             Loc=<foo.cc:3:7>
raw_identifier 'int'           [StartOfLine]     Loc=<foo.cc:5:1>
unknown ' '                        Loc=<foo.cc:5:4>
raw_identifier 'f'                             Loc=<foo.cc:5:5>
l_paren '('                           Loc=<foo.cc:5:6>
r_paren ')'                          Loc=<foo.cc:5:7>
semi ';'                 Loc=<foo.cc:5:8>
unknown '
'                             Loc=<foo.cc:5:9>


clang-format then has a bunch of heuristics to decide if `a * b` is a multiplication or a declaration, but since it doesn't build an AST as you say, it doesn't know if "a" in two different places refer to the same variable. So in general it can't be used for most automated refactorings, since you usually need ASTs for that.

(clang-format works great for formatting the output of an automated refactoring though.)

On Wed, Aug 1, 2018 at 5:25 AM Stuart Thomson via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi,

I’m interested in using clang to refactor snippets of C++ for which I can’t produce an AST. AFAIK this precludes the use of clang tools like clang-check and I wondered if clang-format could be used instead as it doesn’t seem to require the production of an AST. I don’t quite understand how clang-format works and have a couple of questions:


  1.  Is it possible to somehow use clang-format for refactoring C++ according to custom rules? These refactors would be larger scale things than it seems to usually be used for.
  2.  How does clang-format parse C++ without e.g. parsing the includes?

Thanks,
Stuart
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180801/871b8897/attachment.html>


More information about the cfe-dev mailing list