[cfe-dev] [clang-format] Token Annotation limitations

MyDeveloper Day via cfe-dev cfe-dev at lists.llvm.org
Thu Dec 2 03:01:44 PST 2021


I've recently been looking into a bug in clang format, that incorrectly
treats * as a TT_PointerOrReference

void f() { operator+(a, b *b); }

Actually the misinterpretation of * and & as BinaryOperators or
PointerOrReference is a major source of bugs for clang-format, this is
because we don't have semantic information and often we are looking very
myopically at adjacent tokens and there just isn't enough context to
disambiguate.

So whilst we see b * b this a easily (b x b)

This is actually just

tok::identifier->tok::star->tok::identifier

Which is indistinguishable from

MyClass *b

which I think we'd all see as (class ptr b), but in all cases we have no
additional information, other than perhaps scanning up and down the line to
find something relevant.

This literally drops out the bottom of determineStarAmpUsage() with a
return TT_PointerOrReference;

For the bug I'm looking at, given the following example the first f() puts
the * as a pointer when actually its a multiply;

void f() { operator+(Bar, Foo *Foo); }

class A {
  void operator+(Bar, Foo *Foo);
}

Effectively these two instances of operator appears as the same with almost
no difference in terms of tokens,

Looking at the token annotations, there is almost nothing that
distinguishes between the two.

AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=void L=4 PPK=2 FakeLParens=
FakeRParens=0 II=0x1baa688a118 Text='void'
 M=0 C=1 T=FunctionDeclarationName S=1 F=0 B=0 BK=0 P=80 Name=identifier
L=6 PPK=2 FakeLParens= FakeRParens=0 II=0x1baa68bd9c8 Te
xt='f'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=l_paren L=7 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='('
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=140 Name=r_paren L=8 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text=')'
 M=0 C=0 T=FunctionLBrace S=1 F=0 B=0 BK=1 P=23 Name=l_brace L=10 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='{'
----
AnnotatedTokens(L=1):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=operator L=8 PPK=2
FakeLParens= FakeRParens=0 II=0x1baa688a710 Text='operator'
 M=0 C=0 T=OverloadedOperator S=0 F=0 B=0 BK=0 P=33 Name=plus L=9 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='+'
 M=0 C=0 T=OverloadedOperatorLParen S=0 F=0 B=0 BK=0 P=33 Name=l_paren L=10
PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='('
 M=0 C=1 T=Unknown S=0 F=0 B=0 BK=0 P=59 Name=identifier L=13 PPK=2
FakeLParens=1/ FakeRParens=0 II=0x1baa68bd9f8 Text='Bar'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=41 Name=comma L=14 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text=','
 M=0 C=1 T=Unknown S=1 F=0 B=0 BK=0 P=41 Name=identifier L=18 PPK=2
FakeLParens= FakeRParens=0 II=0x1baa68bda28 Text='Foo'
 M=0 C=1 T=PointerOrReference S=1 F=0 B=0 BK=0 P=230 Name=star L=20 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='*'
 M=0 C=1 T=StartOfName S=0 F=0 B=0 BK=0 P=240 Name=identifier L=23 PPK=2
FakeLParens= FakeRParens=1 II=0x1baa68bda28 Text='Foo'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=43 Name=r_paren L=24 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text=')'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=semi L=25 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text=';'
----
AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=r_brace L=1 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text='}'
----
AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=class L=5 PPK=2 FakeLParens=
FakeRParens=0 II=0x1baa688a4d8 Text='class'
 M=0 C=1 T=Unknown S=1 F=0 B=0 BK=0 P=5020 Name=identifier L=7 PPK=2
FakeLParens= FakeRParens=0 II=0x1baa68bda58 Text='A'
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=1 P=23 Name=l_brace L=9 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='{'
----
AnnotatedTokens(L=1):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=void L=4 PPK=2 FakeLParens=
FakeRParens=0 II=0x1baa688a118 Text='void'
 M=0 C=1 T=FunctionDeclarationName S=1 F=0 B=0 BK=0 P=80 Name=operator L=13
PPK=2 FakeLParens= FakeRParens=0 II=0x1baa688a710 Tex
t='operator'
 M=0 C=0 T=OverloadedOperator S=0 F=0 B=0 BK=0 P=33 Name=plus L=14 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='+'
 M=0 C=0 T=OverloadedOperatorLParen S=0 F=0 B=0 BK=0 P=33 Name=l_paren L=15
PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='('
 M=0 C=1 T=Unknown S=0 F=0 B=0 BK=0 P=140 Name=identifier L=18 PPK=2
FakeLParens=1/ FakeRParens=0 II=0x1baa68bd9f8 Text='Bar'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=41 Name=comma L=19 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text=','
 M=0 C=1 T=Unknown S=1 F=0 B=0 BK=0 P=41 Name=identifier L=23 PPK=2
FakeLParens= FakeRParens=0 II=0x1baa68bda28 Text='Foo'
 M=0 C=1 T=PointerOrReference S=1 F=0 B=0 BK=0 P=230 Name=star L=25 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text='*'
 M=0 C=1 T=StartOfName S=0 F=0 B=0 BK=0 P=240 Name=identifier L=28 PPK=2
FakeLParens= FakeRParens=1 II=0x1baa68bda28 Text='Foo'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=43 Name=r_paren L=29 PPK=2
FakeLParens= FakeRParens=0 II=0x0 Text=')'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=semi L=30 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text=';'
----
AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=r_brace L=1 PPK=2 FakeLParens=
FakeRParens=0 II=0x0 Text='}'
...

We've always known this is an issue, and it's to do with the fact that we
don't have information near the "*" that tells us in what context it is.

I feel that actually part of the problem is that we are not setting the
"setType" on the tokens i.e. T=Unknown in all cases, I think we have more
contextual information
than we think, I just feel like we throw it away a lot of the time.

Think about how you look at the code visually.. how do you determine that
its a definition rather than a declaration? Both bits of code are almost
identical so what is it that's forcing us to see the difference?

Somehow I feel we should be classifying as many of the tokens as we process
them. Every (){} and , every type or return type. I feel with more
information we could
solve these types of issues.

By introducing a few more types and then systematically trying to annotate
all the tokens we could have greater inference.

To me I feel, that given that we see operator as FunctionDeclarationName we
should then change the ( and ) to be OverloadedOperatorDeclarationLParen
and OverloadedOperatorDeclarationRParen

We should classify the "," in the declaration as FunctionDeclarationComma

>From that we could determine that Bar and Foo (1st) were Type declarations,
and that would make a rule for determining the * to be almost trivial.

I think in clang-format we need to take a more holistic approach to
TokenAnnotations, classify as many tokens as we can as accurately as we
can. T=Unknown is almost useless to us, and that
lack of context easily means we misinterpret.

Any thoughts?

MyDeveloperDay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211202/ce1eb145/attachment-0001.html>


More information about the cfe-dev mailing list