[cfe-dev] Musings about UCNs

Tue Jan 26 12:23:02 PST 2010

On Jan 25, 2010, at 11:42 PM, Sean Hunt wrote:

> I've been eyeing UCNs for a while, and so I've got a few musings to
> share; perhaps they will help whoever gets around to implementing  
> them.
>
> Disclaimer: I'm basing this off the C++ spec. If there are
> differences/incompatbilities for the C spec, I haven't noticed.
>
> Thoughts:
>  - We should probably use UTF-8 internally because it has a bunch of
> nice features, like not breaking any existing code within clang.

yes.

>  - We could also accept UTF-8 as the default character encoding and
> process extended characters directly. The driver should handle other
> encodings by converting them to UTF-8.

We should have SourceMgr do this, the driver doesn't know about all  
the headers etc.

>  - Pursuant to that, does clang currently assume it's being compiled  
> on
> an ASCII system?

Yes, we don't care about non-ascii systems.  When we do, sourcemgr can  
translate them as well.

>  - To reduce performance hits, we should only scan a given identifier
> once to see if it contains any illegal characters.

Yes, the lexer should just handle this in the identifier lexing code.   
The common case is "no ucn" so any ucn characters should cause a  
branch out of the fastpath into the existing slow case of identifier  
lexing.

> I'm thinking the
> Token should store whether it contains a universal-character as it
> stores whether or not it needs cleaning, and IdentifierTable::get()  
> gets
> a default parameter added; if it's set and the identifier is not  
> already
> in the table, then a check is performed, ideally on a precompiled  
> trie.

I don't think this is necessary.  The IdentifierInfo* should contain  
the canonicalized utf8 encoding, and the spelling is whatever is in  
the code (after sourcemgr switches the character set).

>  - For literals, UCN processing will occur in the token lexer invoked
> by Sema later on, including conversion to the execution character  
> set if
> necessary.

Sure.

>  - How extended characters should be stored in names in unclear.
> Ancient cxx-abi-dev discussions are undecided on whether simply using
> UTF-8 is correct. GCC code seems to suggest this is the intent in the
> long run.

Storing canonicalized utf8 in the identifiers is the only reasonable  
thing to do.

-Chris