[llvm-bugs] [Bug 38870] New: Add warnings for unusual unicode identifiers
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Sep 7 11:46:12 PDT 2018
https://bugs.llvm.org/show_bug.cgi?id=38870
Bug ID: 38870
Summary: Add warnings for unusual unicode identifiers
Product: clang
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: Frontend
Assignee: unassignedclangbugs at nondot.org
Reporter: jyknight at google.com
CC: llvm-bugs at lists.llvm.org
The use of unicode in source code can cause great confusion to readers. This is
true no matter what, but it is rather exacerbated by the fact that the
C++11/C11 standards have a syntax for identifiers which does not follow the
unicode consortium's recommendation for identifiers
(http://www.unicode.org/reports/tr31/)
Notably, the C/C++ syntax allows a bunch of codepoints which are invisible
format characters (in the unicode character class "Cf"). One instance which I
just ran into in actual source code (causing great consternation!) is U+200b
ZERO WIDTH SPACE. In multiple cases in our codebase, developers have managed to
type (or copy/paste) that character into an identifier declaration, and then
used editor completion to copy that mistake into the uses of the name, too.
I believe clang should have warnings for this -- both an default-on warning for
the characters that really shouldn't be present at all, and some optional
warnings for further restricting the character-set used.
I propose 3 new warnings. None of these should warn on the use of \u escapes in
an identifier, *only* for unicode characters literally in the source code.
-Wunicode-identifier-unusual:
Warn on the use of a non-escaped character in an identifier which is not a
usual identifier character (per Unicode Consortium UAX#31's ID_Continue). This
will warn on the usage of invisible format characters, amongst others. Enabled
by default.
-Wunicode-identifier-unusual=2:
Warn on the use of a non-escaped character in an identifier which are not
valid in UAX#31's ID_continue table, and on those which are listed in the
additional candidates for exclusion table.
-Wunicode-identifier:
Warn on the use of any unescaped non-ascii character in an identifier.
An ID_Continue table has 707 ranges, and a table of excluded ranges for the
additional candidates to exclude for the 2nd level warning is somewhere around
230 ranges.
Thus, this would probably add about 7K of extra static data to clang, which
doesn't seem an unreasonable amount.
While checking the codebase for this sort of issue, some instances in string
constants were also found -- and although one such instance was a bug, I don't
believe strange characters in string constants is as obviously wrong, so I
don't propose to do anything about that.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180907/ea7b04e2/attachment.html>
More information about the llvm-bugs
mailing list