[cfe-dev] Musings about UCNs

Mon Jan 25 23:42:30 PST 2010

I've been eyeing UCNs for a while, and so I've got a few musings to 
share; perhaps they will help whoever gets around to implementing them.

Disclaimer: I'm basing this off the C++ spec. If there are 
differences/incompatbilities for the C spec, I haven't noticed.

Thoughts:
  - We should probably use UTF-8 internally because it has a bunch of 
nice features, like not breaking any existing code within clang.
  - We could also accept UTF-8 as the default character encoding and 
process extended characters directly. The driver should handle other 
encodings by converting them to UTF-8.
  - Pursuant to that, does clang currently assume it's being compiled on 
an ASCII system?
  - To reduce performance hits, we should only scan a given identifier 
once to see if it contains any illegal characters. I'm thinking the 
Token should store whether it contains a universal-character as it 
stores whether or not it needs cleaning, and IdentifierTable::get() gets 
a default parameter added; if it's set and the identifier is not already 
in the table, then a check is performed, ideally on a precompiled trie.
  - For literals, UCN processing will occur in the token lexer invoked 
by Sema later on, including conversion to the execution character set if 
necessary.
  - How extended characters should be stored in names in unclear. 
Ancient cxx-abi-dev discussions are undecided on whether simply using 
UTF-8 is correct. GCC code seems to suggest this is the intent in the 
long run.

Sean