[cfe-dev] A development plan for enhanced character handling

Sun Jun 7 15:34:39 PDT 2009

On Sun, Jun 7, 2009 at 2:39 PM, AlisdairM(public)<public at alisdairm.net> wrote:
> Follow-up to my stream of earlier emails!
>
> My initial pilot-project to get familiar with clang and its development
> process has grown somewhat, so I thought I better outline a project plan
> before I get too far.
>
> So first question:
>  Is there a recognised process for initiating/tracking such projects?
>  Or is the community small and informal enough that this mailing list is
> sufficient?

This mailing list is sufficient for coordination; there aren't so many
people that we've ever needed a formal process.

> The plan for my project is to support:
>        New C++0x Unicode character types and literals
>        C99 Unicode TR character types
>        C++0x raw string literals (C++ only, or would ObjectiveC/C++ be
> interested?)
>        Support for UCNs in identifiers
>        Support source files encoded as UTF-8/UTF-16/UTF-32

All of those look good; they're mostly independent, though.

Note that Objective-C is generally considered as an extension to some
base language.  Therefore, ObjectiveC on C99 or C++98 wouldn't get raw
string literals, but ObjectiveC on C++0x would get them.

> The last of those points is probably most controversial, and I need clear
> guidance on how to do this efficiently.
>
> For source files, the ultimate plan would be to read the file into memory,
> then check for a BOM.
>
> If no BOM is present, assume file is UTF-8 and proceed as today.
>        Permit but do not require a UTF-8 BOM
> If a UTF-16/32 BOM is present, transcode the file into UTF-8
>        Pass this UTF-8 encoded buffer into the pre-processor, as today
>
> What is not clear is whether I should keep the original source file in
> memory to help when reporting diagnostics, or whether we simply keep the
> UTF-8 buffer knowing we can transcode text back to the original encoding on
> demand if necessary.

We have to transcode on demand for generality: suppose the file is in
UTF-16 and the terminal is ShiftJIS.  I can't think of any situation
where we would need access to the file in its original encoding.

> However, in order to ensure success the first task must be to audit the
> existing code to be sure that it correctly handles source files with UTF-8
> characters outside the basic ASCII set.  Any issues here should probably be
> my first order of business.

We don't calculate column numbers correctly in various cases involving
non-ASCII characters; I can't think of any other issues.

Somewhat nasty testcase for the column numbers:
void 風; void 風; void 風; void 風; void 風; void 風;

-Eli