[cfe-dev] [clang-format] Token Annotation limitations
Björn Schäpers via cfe-dev
cfe-dev at lists.llvm.org
Thu Dec 2 11:37:54 PST 2021
Am 02.12.2021 um 12:01 schrieb MyDeveloper Day via cfe-dev:
> I've recently been looking into a bug in clang format, that incorrectly treats *
> as a TT_PointerOrReference
>
> void f() { operator+(a, b *b); }
>
> Actually the misinterpretation of * and & as BinaryOperators or
> PointerOrReference is a major source of bugs for clang-format, this is because
> we don't have semantic information and often we are looking very myopically at
> adjacent tokens and there just isn't enough context to disambiguate.
>
> So whilst we see b * b this a easily (b x b)
>
> This is actually just
>
> tok::identifier->tok::star->tok::identifier
>
> Which is indistinguishable from
>
> MyClass *b
>
> which I think we'd all see as (class ptr b), but in all cases we have no
> additional information, other than perhaps scanning up and down the line to find
> something relevant.
>
> This literally drops out the bottom of determineStarAmpUsage() with a
> return TT_PointerOrReference;
>
> For the bug I'm looking at, given the following example the first f() puts the *
> as a pointer when actually its a multiply;
>
> void f() { operator+(Bar, Foo *Foo); }
>
> class A {
> void operator+(Bar, Foo *Foo);
> }
>
> Effectively these two instances of operator appears as the same with almost no
> difference in terms of tokens,
>
> Looking at the token annotations, there is almost nothing that distinguishes
> between the two.
>
> [snip]
>
> We've always known this is an issue, and it's to do with the fact that we don't
> have information near the "*" that tells us in what context it is.
>
> I feel that actually part of the problem is that we are not setting the
> "setType" on the tokens i.e. T=Unknown in all cases, I think we have more
> contextual information
> than we think, I just feel like we throw it away a lot of the time.
>
> Think about how you look at the code visually.. how do you determine that its a
> definition rather than a declaration? Both bits of code are almost
> identical so what is it that's forcing us to see the difference?
>
> Somehow I feel we should be classifying as many of the tokens as we process
> them. Every (){} and , every type or return type. I feel with more information
> we could
> solve these types of issues.
>
> By introducing a few more types and then systematically trying to annotate all
> the tokens we could have greater inference.
>
> To me I feel, that given that we see operator as FunctionDeclarationName we
> should then change the ( and ) to be OverloadedOperatorDeclarationLParen and
> OverloadedOperatorDeclarationRParen
>
> We should classify the "," in the declaration as FunctionDeclarationComma
>
> From that we could determine that Bar and Foo (1st) were Type declarations, and
> that would make a rule for determining the * to be almost trivial.
>
> I think in clang-format we need to take a more holistic approach to
> TokenAnnotations, classify as many tokens as we can as accurately as we can.
> T=Unknown is almost useless to us, and that
> lack of context easily means we misinterpret.
>
> Any thoughts?
>
> MyDeveloperDay
>
Seems reasonable to me. I also think we should add and keep more information in
the unwrapped line parser.
Kind regards,
Björn.
More information about the cfe-dev
mailing list