[cfe-dev] Wide strings and clang::StringLiteral.

Fri Dec 5 16:02:23 PST 2008

Chris Lattner wrote:-

>> The standard also requires input to be in the current locale; is
>> there any need to be more relaxed?
>
> No.

Right, and if you want a different locale there's always a command
line switch and setlocale.  Locale is at least an out-of-the-box
supported thing on a system with a C compiler.

>> Realistically all the source
>> has to be in the same charset, and that charset must include the
>> ability to read the system headers.  You then just get to use
>> mbtowc in a few places.
>
> Can you give some pseudocode of what you mean?

Below, where I write ASCII, I really mean basic source-charset;
the same logic works if you're on EBCDIC hosts or if your
shift-to-extended-charset character is ASCII 26 (?).

I'm just talking about e.g.

a) the main switch statement of the lexer; the default case
   (assuming you want to accept native charset identifiers,
   which is a nice touch and not hard to do) becomes a
   call to lex_identifier(), and only an "other" token if
   this call doesn't succeed.

b) comments: there's not much to do here but to lex in a
   multibyte aware fashion.  The main loop of my mb-aware block
   comment lexer is, for example:

  while (!is_at_eof (lexer))
    {
      prevc = c;
      c = lexer_get_clean_mbchar (lexer, token);

      /* People like decorating comments with '*', so check for '/'
	 instead for efficiency.  */
      if (c == '/' && prevc == '*')
	return;
    }

   which is one way of doing it.  If you're worried about
   performance, you could a) only support multibyte charsets
   as a compile-time option, so people who don't want it don't
   get it, or b) only fall through to the generic mbchar-aware
   slower code once you've read a non-ASCII character.  Most
   comments are pure ASCII so won't fall into the slow path.

c) identifiers: again the fast loop, and a slower loop if you
   want to support native identifiers and a non-ASCII character
   is encountered.  When hashing you convert to UTF-8 or something,
   you need to do that if there are UCNs too.  You can flag these
   non-clean identifiers just like you flag trigraph or escaped-newline
   non-clean identifiers.

d) numbers - if you look at the lexer grammar these are a superset of 
   identifiers with +, - and . characters.  They could be pasted
   with an identifier to create another identifier, for example.
   Reuse the identifier logic, or cut-n-paste.

e) literals (strings, character constants and header names);
   in my case they use lexer_get_clean_mbchar() function shown
   above, but you could do a fast-track and slow-track thing too.

In my case I have mbchar support a compile-time option; if it's
turned off lexer_get_clean_mbchar is a macro that becomes
lexer_get_clean_char, for example.

I've found that a fully mb-char aware lexer is about 25-30% slower
than one with it compiled out.  But I've not tried to optimize
comments and identifier lexing to have a fast and slow path.  So
25%-30% is probably a worst case slowdown.

Neil.