[cfe-dev] Handling Unicode : code points vs. code units

Mon Jun 15 08:38:16 PDT 2009

Hopefully the last question before I start posting some patches on this!

I think the big problem we face correctly handling extended characters in UTF-8 (and to a lesser extent UCNs in identifiers) is that much/all of the current code assumes that code points and code units are the same.

In Unicode terms, the character is called a code point, which corresponds to the 21 bit value representing a single character from the Unicode character set.  This is a direct mapping in UTF-32 encodings, but may require multiple 'code units' in UTF8/UTF-16.

The effect shows up any time a character outside the basic 7-bit ASCII range shows up in a string literal or a comment.  The column numbers for any diagnostic on that line will be wrong beyond that character.

Fundamentally, if we want to get this 'right' we should stop talking about characters and deal exclusively with character sequences.  Once we are dealing with UTF-8 we can no longer assume a 'character' will fit into a single 'char' variable.  I am not yet sure how pervasive such a change would be, as AFAICT most functions are already passing around pointers (in)to text buffers.  The difference may be more in how we approach the code than in the source itself.

However, on place it really does matter is reporting column numbers in diagnostics.  We need to report column numbers in terms of characters, or code positions, rather than code units as today.

In order to clarify when we are dealing explicitly with code positions, I propose to introduce a new class to describe such offsets rather than using a simple integer variable.  Thanks to inlining this class should have performance equivalent to a regular integer apart from in a couple of use cases.  The majority of the codebase should be unaffected, continuing to work in terms of byte offsets into a buffer.  However, whenever we need to render a source-file column number we should go via this new type.  The opaque type should catch many issues with code point vbs. Code unit at compile rather than runtime, although I don't have an exhaustive list of APIs that should be updated yet, so we must learn to alert for APIs using the wrong types as well.

A quick sketch of the class would look a little (or a lot!) like:

#include <cstddef>

struct CharacterPos {
   // We must declare a default constructor as there is another
   // user declared constructor in this class.
   // Choose to always initialize the position member. This means
   // that CharacterPos is not a POD class.  In C++0x we might
   // consider using = default, which might leave position
   // uninitialized, although 0x is more fine-grained in its
   // usage of PODs and trivial operations, so explicit initialization
   // is probably still the best choice.
   CharacterPos() : position() {}

   // The remaining special members left implicit in order to
   // preserve triviality.  In C++0x would explicitly default them.
//   CharacterPos( CharacterPos const & rhs) = default;
//   ~CharacterPos() = default;
//   CharacterPos & operator=( CharacterPos const & rhs ) = default;

   // Constructor iterates string from start to offset
   // counting UTF-8 characters i.e code points.
   // Throws an exception if str is not a valid UTF-8 encoding.
   CharacterPos( char const * str, std::size_t offset );

   // Iterates str, returning a pointer to the initial code unit
   // of the UTF-8 character at 'position'.
   // Throws an exception if str is not a valid UTF-8 encoding.
   char const * Offset( char const * str ) const;

   CharacterPos & operator+=( CharacterPos rhs ) {
      position += rhs.position;
      return *this;
   }

   CharacterPos & operator-=( CharacterPos rhs ) {
      position -= rhs.position;
      return *this;
   }

   std::ptrdiff_t operator-( CharacterPos rhs ) const {
      return position - rhs.position;
   }

  bool operator==( CharacterPos rhs) const { return position == rhs.position; }
  bool operator<( CharacterPos rhs) const { return position < rhs.position; }
  bool operator<=( CharacterPos rhs) const { return position <= rhs.position; }
  bool operator!=( CharacterPos rhs) const { return !(*this == rhs); }
  bool operator>( CharacterPos rhs) const { return rhs < *this; }
  bool operator>=( CharacterPos rhs) const { return rhs <= *this; }

private:
   std::size_t position;
};

CharacterPos operator+( CharacterPos lhs, CharacterPos rhs ) {
   return lhs += rhs;
}

char const * operator+( char const * lhs, CharacterPos rhs ) {
   return rhs.Offset( lhs );
}

Note that two operations in here have linear complexity rather than constant:
   CharacterPos( char const * str, std::size_t offset );
   char const * Offset( char const * str ) const;

These are also the important APIs that define why the class exists.
In all other ways it should be a reasonable arithmetic type.

I am opting for pass-by-value rather than pass-by-reference-to-const as that is typically more efficient for small data types, although obviously I have no performance measurements to back that up yet.

Also note that these same two APIs have to deal with badly encoded UTF-8 streams, and indicate failure by throwing an exception.  I have informally picked up that LLVM/Clang prefer to avoid exceptions as error reporting mechanisms. If this is likely to be an issue I would appreciate guidance on an alternate error reporting mechanism for those same APIs - especially the failed constructor.

AlisdairM