[cfe-dev] A development plan for enhanced character handling
AlisdairM(public)
public at alisdairm.net
Sun Jun 7 14:39:32 PDT 2009
Follow-up to my stream of earlier emails!
My initial pilot-project to get familiar with clang and its development
process has grown somewhat, so I thought I better outline a project plan
before I get too far.
So first question:
Is there a recognised process for initiating/tracking such projects?
Or is the community small and informal enough that this mailing list is
sufficient?
The plan for my project is to support:
New C++0x Unicode character types and literals
C99 Unicode TR character types
C++0x raw string literals (C++ only, or would ObjectiveC/C++ be
interested?)
Support for UCNs in identifiers
Support source files encoded as UTF-8/UTF-16/UTF-32
The last of those points is probably most controversial, and I need clear
guidance on how to do this efficiently.
For source files, the ultimate plan would be to read the file into memory,
then check for a BOM.
If no BOM is present, assume file is UTF-8 and proceed as today.
Permit but do not require a UTF-8 BOM
If a UTF-16/32 BOM is present, transcode the file into UTF-8
Pass this UTF-8 encoded buffer into the pre-processor, as today
What is not clear is whether I should keep the original source file in
memory to help when reporting diagnostics, or whether we simply keep the
UTF-8 buffer knowing we can transcode text back to the original encoding on
demand if necessary.
We can assume a 1-1 correspondence of characters between these encodings, so
I don't foresee a problem working purely with UTF-8 as today, and
transcoding to the wider formats on demand the few times the original
encoding matters.
However, in order to ensure success the first task must be to audit the
existing code to be sure that it correctly handles source files with UTF-8
characters outside the basic ASCII set. Any issues here should probably be
my first order of business.
Assuming the code-audit passes, my provision project plan is to implement in
the following stages, with a commit after each stage. The first task is the
simplest, so I will use that to learn the protocols of adequate testing,
documentation etc. for a check-in review.
Suggested implementation sequence:
UTF-8 literals u8"tra-la-la"
do not accidentally invent u8 character literals
concatenation with wchar_t L"literal" is diagnosable error
do not break regular narrow literal concatenation with
wchar_t
implement native Unicode char types
char16_t/char32_t for C++
_Char16_t/_Char32_t for C99 Unicode TR
define the 'always Unicode' macro specified in Unicode TR
_ _ STDC_ISO_10646 _ _ == yyyymmL for year/month of
latest spec supported
implies wchar_t is a unicode encoding
implement Unicode character literals
single characters only
char16_t must be from basic character plane
implement Unicode string literals
involves expanding the 'AnyWide' bool to support
char/char16_t/char32_t/wchar_t, and maybe u8
Do not drag in raw literals at this point, combinatorial
flag explosion
must define our own heterogeneous concatenation rules
recommend all conditionally supported conversions
diagnosable errors for initial check-in
define and implement any support for heterogeneous string
concatenation
note rules:
char -> wchar_t required
no u8 -> wchar_t
u8 -> char32_t conditionally supported
char -> char32_t required
char32_t -> wide conditionally supported
implement raw string literals
[Core issue] must any non-basic-source character be treated
as 6 (or 10) d-chars?
recommend we issue a diagnostic warning, but accept
code
Beman concerned this encourages non-portable code
Finally implement non-UTF-8 file support
[[pre-requisite - be sure the Unicode transcoding utilities
are all implemented and validated]]
Read/map file into memory
Check for BOM
Permit but do not require UTF-8 BOM
If BOM is missing
assume UTF-8 encoding
else
flag source file as that encoding
If encoding is not supported
issue a diagnostic
If encoding is not UTF-8
transcode file and pass on transcoded buffer
Q: do we retain original buffer for
diagnostics?
Q: do we transcode diagnostic messages back
to source encoding on the fly?
we have UTF-8 buffer so can
round-trip just that source
Done!
(apart from processing bug reports)
As for a time-table, this is a fair chunk of work now so I don't plan on
anything beyond u8 literal support this side of the next C++ standards
meeting in Frankfurt, as that will claim the lion's share of my attention
until then.
AlisdairM
More information about the cfe-dev
mailing list