[cfe-dev] A development plan for enhanced character handling

Sun Jun 7 14:39:32 PDT 2009

Follow-up to my stream of earlier emails!

My initial pilot-project to get familiar with clang and its development
process has grown somewhat, so I thought I better outline a project plan
before I get too far.  

So first question: 
  Is there a recognised process for initiating/tracking such projects?
  Or is the community small and informal enough that this mailing list is
sufficient?

The plan for my project is to support:
	New C++0x Unicode character types and literals
	C99 Unicode TR character types
	C++0x raw string literals (C++ only, or would ObjectiveC/C++ be
interested?)
	Support for UCNs in identifiers
	Support source files encoded as UTF-8/UTF-16/UTF-32

The last of those points is probably most controversial, and I need clear
guidance on how to do this efficiently.

For source files, the ultimate plan would be to read the file into memory,
then check for a BOM.

If no BOM is present, assume file is UTF-8 and proceed as today.
	Permit but do not require a UTF-8 BOM
If a UTF-16/32 BOM is present, transcode the file into UTF-8
	Pass this UTF-8 encoded buffer into the pre-processor, as today

What is not clear is whether I should keep the original source file in
memory to help when reporting diagnostics, or whether we simply keep the
UTF-8 buffer knowing we can transcode text back to the original encoding on
demand if necessary.

We can assume a 1-1 correspondence of characters between these encodings, so
I don't foresee a problem working purely with UTF-8 as today, and
transcoding to the wider formats on demand the few times the original
encoding matters.

However, in order to ensure success the first task must be to audit the
existing code to be sure that it correctly handles source files with UTF-8
characters outside the basic ASCII set.  Any issues here should probably be
my first order of business.

Assuming the code-audit passes, my provision project plan is to implement in
the following stages, with a commit after each stage.  The first task is the
simplest, so I will use that to learn the protocols of adequate testing,
documentation etc. for a check-in review.

Suggested implementation sequence:
	UTF-8 literals u8"tra-la-la"
		do not accidentally invent u8 character literals
		concatenation with wchar_t L"literal" is diagnosable error
		do not break regular narrow literal concatenation with
wchar_t

	implement native Unicode char types
		char16_t/char32_t for C++
		_Char16_t/_Char32_t for C99 Unicode TR
		define the 'always Unicode' macro specified in Unicode TR
			_ _ STDC_ISO_10646 _ _ == yyyymmL for year/month of
latest spec supported
			implies wchar_t is a unicode encoding

	implement Unicode character literals
		single characters only
		char16_t must be from basic character plane

	implement Unicode string literals
		involves expanding the 'AnyWide' bool to support
char/char16_t/char32_t/wchar_t, and maybe u8
		Do not drag in raw literals at this point, combinatorial
flag explosion
		must define our own heterogeneous concatenation rules
			recommend all conditionally supported conversions
diagnosable errors for initial check-in

	define and implement any support for heterogeneous string
concatenation
		note rules:
			char -> wchar_t required
			no u8 -> wchar_t
			u8 -> char32_t conditionally supported
			char -> char32_t required
			char32_t -> wide conditionally supported

	implement raw string literals
		[Core issue] must any non-basic-source character be treated
as 6 (or 10) d-chars?
			recommend we issue a diagnostic warning, but accept
code
			Beman concerned this encourages non-portable code

	Finally implement non-UTF-8 file support
		[[pre-requisite - be sure the Unicode transcoding utilities
are all implemented and validated]]
		Read/map file into memory
		Check for BOM
		Permit but do not require UTF-8 BOM
		If BOM is missing
			assume UTF-8 encoding
		else
			flag source file as that encoding
			If encoding is not supported
				issue a diagnostic
			If encoding is not UTF-8
				transcode file and pass on transcoded buffer
				Q: do we retain original buffer for
diagnostics?
				Q: do we transcode diagnostic messages back
to source encoding on the fly?
					we have UTF-8 buffer so can
round-trip just that source

	Done!
		(apart from processing bug reports)

As for a time-table, this is a fair chunk of work now so I don't plan on
anything beyond u8 literal support this side of the next C++ standards
meeting in Frankfurt, as that will claim the lion's share of my attention
until then.

AlisdairM