[cfe-dev] [clang-format] Token Annotation limitations

Thu Dec 2 11:37:54 PST 2021

Am 02.12.2021 um 12:01 schrieb MyDeveloper Day via cfe-dev:
> I've recently been looking into a bug in clang format, that incorrectly treats * 
> as a TT_PointerOrReference
> 
> void f() { operator+(a, b *b); }
> 
> Actually the misinterpretation of * and & as BinaryOperators or 
> PointerOrReference is a major source of bugs for clang-format, this is because 
> we don't have semantic information and often we are looking very myopically at 
> adjacent tokens and there just isn't enough context to disambiguate.
> 
> So whilst we see b * b this a easily (b x b)
> 
> This is actually just
> 
> tok::identifier->tok::star->tok::identifier
> 
> Which is indistinguishable from
> 
> MyClass *b
> 
> which I think we'd all see as (class ptr b), but in all cases we have no 
> additional information, other than perhaps scanning up and down the line to find 
> something relevant.
> 
> This literally drops out the bottom of determineStarAmpUsage() with a
> return TT_PointerOrReference;
> 
> For the bug I'm looking at, given the following example the first f() puts the * 
> as a pointer when actually its a multiply;
> 
> void f() { operator+(Bar, Foo *Foo); }
> 
> class A {
>    void operator+(Bar, Foo *Foo);
> }
> 
> Effectively these two instances of operator appears as the same with almost no 
> difference in terms of tokens,
> 
> Looking at the token annotations, there is almost nothing that distinguishes 
> between the two.
> 
> [snip]
> 
> We've always known this is an issue, and it's to do with the fact that we don't 
> have information near the "*" that tells us in what context it is.
> 
> I feel that actually part of the problem is that we are not setting the 
> "setType" on the tokens i.e. T=Unknown in all cases, I think we have more 
> contextual information
> than we think, I just feel like we throw it away a lot of the time.
> 
> Think about how you look at the code visually.. how do you determine that its a 
> definition rather than a declaration? Both bits of code are almost
> identical so what is it that's forcing us to see the difference?
> 
> Somehow I feel we should be classifying as many of the tokens as we process 
> them. Every (){} and , every type or return type. I feel with more information 
> we could
> solve these types of issues.
> 
> By introducing a few more types and then systematically trying to annotate all 
> the tokens we could have greater inference.
> 
> To me I feel, that given that we see operator as FunctionDeclarationName we 
> should then change the ( and ) to be OverloadedOperatorDeclarationLParen and 
> OverloadedOperatorDeclarationRParen
> 
> We should classify the "," in the declaration as FunctionDeclarationComma
> 
>  From that we could determine that Bar and Foo (1st) were Type declarations, and 
> that would make a rule for determining the * to be almost trivial.
> 
> I think in clang-format we need to take a more holistic approach to 
> TokenAnnotations, classify as many tokens as we can as accurately as we can. 
> T=Unknown is almost useless to us, and that
> lack of context easily means we misinterpret.
> 
> Any thoughts?
> 
> MyDeveloperDay
> 

Seems reasonable to me. I also think we should add and keep more information in 
the unwrapped line parser.

Kind regards,
Björn.