[llvm-bugs] [Bug 38870] New: Add warnings for unusual unicode identifiers

via llvm-bugs llvm-bugs at lists.llvm.org
Fri Sep 7 11:46:12 PDT 2018


https://bugs.llvm.org/show_bug.cgi?id=38870

            Bug ID: 38870
           Summary: Add warnings for unusual unicode identifiers
           Product: clang
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Frontend
          Assignee: unassignedclangbugs at nondot.org
          Reporter: jyknight at google.com
                CC: llvm-bugs at lists.llvm.org

The use of unicode in source code can cause great confusion to readers. This is
true no matter what, but it is rather exacerbated by the fact that the
C++11/C11 standards have a syntax for identifiers which does not follow the
unicode consortium's recommendation for identifiers
(http://www.unicode.org/reports/tr31/)


Notably, the C/C++ syntax allows a bunch of codepoints which are invisible
format characters (in the unicode character class "Cf"). One instance which I
just ran into in actual source code (causing great consternation!) is U+200b
ZERO WIDTH SPACE. In multiple cases in our codebase, developers have managed to
type (or copy/paste) that character into an identifier declaration, and then
used editor completion to copy that mistake into the uses of the name, too.


I believe clang should have warnings for this -- both an default-on warning for
the characters that really shouldn't be present at all, and some optional
warnings for further restricting the character-set used.

I propose 3 new warnings. None of these should warn on the use of \u escapes in
an identifier, *only* for unicode characters literally in the source code.

-Wunicode-identifier-unusual:
  Warn on the use of a non-escaped character in an identifier which is not a
usual identifier character (per Unicode Consortium UAX#31's ID_Continue). This
will warn on the usage of invisible format characters, amongst others. Enabled
by default.

-Wunicode-identifier-unusual=2:
  Warn on the use of a non-escaped character in an identifier which are not
valid in UAX#31's ID_continue table, and on those which are listed in the
additional candidates for exclusion table.

-Wunicode-identifier:
  Warn on the use of any unescaped non-ascii character in an identifier.


An ID_Continue table has 707 ranges, and a table of excluded ranges for the
additional candidates to exclude for the 2nd level warning is somewhere around
230 ranges.

Thus, this would probably add about 7K of extra static data to clang, which
doesn't seem an unreasonable amount.


While checking the codebase for this sort of issue, some instances in string
constants were also found -- and although one such instance was a bug, I don't
believe strange characters in string constants is as obviously wrong, so I
don't propose to do anything about that.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180907/ea7b04e2/attachment.html>


More information about the llvm-bugs mailing list