[cfe-dev] Musings about UCNs
Sean Hunt
rideau3 at gmail.com
Mon Jan 25 23:42:30 PST 2010
I've been eyeing UCNs for a while, and so I've got a few musings to
share; perhaps they will help whoever gets around to implementing them.
Disclaimer: I'm basing this off the C++ spec. If there are
differences/incompatbilities for the C spec, I haven't noticed.
Thoughts:
- We should probably use UTF-8 internally because it has a bunch of
nice features, like not breaking any existing code within clang.
- We could also accept UTF-8 as the default character encoding and
process extended characters directly. The driver should handle other
encodings by converting them to UTF-8.
- Pursuant to that, does clang currently assume it's being compiled on
an ASCII system?
- To reduce performance hits, we should only scan a given identifier
once to see if it contains any illegal characters. I'm thinking the
Token should store whether it contains a universal-character as it
stores whether or not it needs cleaning, and IdentifierTable::get() gets
a default parameter added; if it's set and the identifier is not already
in the table, then a check is performed, ideally on a precompiled trie.
- For literals, UCN processing will occur in the token lexer invoked
by Sema later on, including conversion to the execution character set if
necessary.
- How extended characters should be stored in names in unclear.
Ancient cxx-abi-dev discussions are undecided on whether simply using
UTF-8 is correct. GCC code seems to suggest this is the intent in the
long run.
Sean
More information about the cfe-dev
mailing list