[cfe-dev] Wide strings and clang::StringLiteral.
Neil Booth
neil at daikokuya.co.uk
Fri Dec 5 16:02:23 PST 2008
Chris Lattner wrote:-
>> The standard also requires input to be in the current locale; is
>> there any need to be more relaxed?
>
> No.
Right, and if you want a different locale there's always a command
line switch and setlocale. Locale is at least an out-of-the-box
supported thing on a system with a C compiler.
>> Realistically all the source
>> has to be in the same charset, and that charset must include the
>> ability to read the system headers. You then just get to use
>> mbtowc in a few places.
>
> Can you give some pseudocode of what you mean?
Below, where I write ASCII, I really mean basic source-charset;
the same logic works if you're on EBCDIC hosts or if your
shift-to-extended-charset character is ASCII 26 (?).
I'm just talking about e.g.
a) the main switch statement of the lexer; the default case
(assuming you want to accept native charset identifiers,
which is a nice touch and not hard to do) becomes a
call to lex_identifier(), and only an "other" token if
this call doesn't succeed.
b) comments: there's not much to do here but to lex in a
multibyte aware fashion. The main loop of my mb-aware block
comment lexer is, for example:
while (!is_at_eof (lexer))
{
prevc = c;
c = lexer_get_clean_mbchar (lexer, token);
/* People like decorating comments with '*', so check for '/'
instead for efficiency. */
if (c == '/' && prevc == '*')
return;
}
which is one way of doing it. If you're worried about
performance, you could a) only support multibyte charsets
as a compile-time option, so people who don't want it don't
get it, or b) only fall through to the generic mbchar-aware
slower code once you've read a non-ASCII character. Most
comments are pure ASCII so won't fall into the slow path.
c) identifiers: again the fast loop, and a slower loop if you
want to support native identifiers and a non-ASCII character
is encountered. When hashing you convert to UTF-8 or something,
you need to do that if there are UCNs too. You can flag these
non-clean identifiers just like you flag trigraph or escaped-newline
non-clean identifiers.
d) numbers - if you look at the lexer grammar these are a superset of
identifiers with +, - and . characters. They could be pasted
with an identifier to create another identifier, for example.
Reuse the identifier logic, or cut-n-paste.
e) literals (strings, character constants and header names);
in my case they use lexer_get_clean_mbchar() function shown
above, but you could do a fast-track and slow-track thing too.
In my case I have mbchar support a compile-time option; if it's
turned off lexer_get_clean_mbchar is a macro that becomes
lexer_get_clean_char, for example.
I've found that a fully mb-char aware lexer is about 25-30% slower
than one with it compiled out. But I've not tried to optimize
comments and identifier lexing to have a fast and slow path. So
25%-30% is probably a worst case slowdown.
Neil.
More information about the cfe-dev
mailing list