[cfe-dev] Handling Unicode : code points vs. code units

Eli Friedman eli.friedman at gmail.com
Mon Jun 15 12:08:29 PDT 2009


On Mon, Jun 15, 2009 at 8:38 AM, AlisdairM(public)<public at alisdairm.net> wrote:
> However, on place it really does matter is reporting column numbers in diagnostics.  We need to report column numbers in terms of characters, or code positions, rather than code units as today.

That's not quite right; people don't consider column numbers in terms
of either code positions or code units; humans consider columns in
terms of the actual rendering.  So, for example, combining characters
shouldn't usually be counted towards the column number.

> In order to clarify when we are dealing explicitly with code positions, I propose to introduce a new class to describe such offsets rather than using a simple integer variable.  Thanks to inlining this class should have performance equivalent to a regular integer apart from in a couple of use cases.  The majority of the codebase should be unaffected, continuing to work in terms of byte offsets into a buffer.  However, whenever we need to render a source-file column number we should go via this new type.  The opaque type should catch many issues with code point vbs. Code unit at compile rather than runtime, although I don't have an exhaustive list of APIs that should be updated yet, so we must learn to alert for APIs using the wrong types as well.

This seems like overkill; there aren't enough places in the code that
care about column numbers.  It should be sufficient to audit the
callers to getColumnNumber() and friends in SourceManager.
Essentially, there are three interesting notions of a column number:
the column number in bytes (what that API returns now), the logical
column number used in diagnostics, and the column number in terms of
cells in a monospace font.  I'm not sure if the latter two quantities
should be separate, though.  If we have APIs for the latter two
quantities, I can't think of anything that would need to do
computations on them, or would want the column number in bytes.

-Eli




More information about the cfe-dev mailing list