[cfe-dev] unicode identifiers

Scott Conger scott.conger at gmail.com
Tue Jun 21 13:21:45 PDT 2011


I'm afraid the parsing of the text is the smaller part of the problem.
I've only looked into this briefly, but it was enough to realize
you're going to run into platform specific linking/tool issues.

Windows, for example, normally takes any UTF-16 string in functions
that take strings as input, but in the documentation of GetProcAddress
it says this:

lpProcName [in]

    The function or variable name, or the function's ordinal value. If
this parameter is an ordinal value, it must be in the low-order word;
the high-order word must be zero.

Essentially, they constrained the character set of identifier names to
the basic ASCII characters. You probably won't be able to get a
library with unicode characters to link. Even then, LLVM would need
the brains to convert the UTF-8 string to UTF-16, which Windows
normally expects.

As far as I could tell, you would need to go platform by platform and
see if they demanded any special rules for identifiers in executables
and libraries. Unfortunately, most of them seem to say nothing
explicit about it in the documentation.

The last time I checked gcc required use of -fextended-identifiers,
which was marked as experimental.

-Scott

2011/6/21 Jochen Wilhelmy <j.wilhelmy at arcor.de>:
>
>> Just attach the patch to your email so it can be reviewed here.
>
> here you are. of course this influences the AsmWriter in llvm if
> it sees characters with the high bit set, for example I would not use
> the locale dependent function isalnum()
>
> -Jochen
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>



More information about the cfe-dev mailing list