[cfe-dev] Handling Unicode : code points vs. code units

AlisdairM(public) public at alisdairm.net
Tue Jun 16 15:16:40 PDT 2009


Thanks - two extremely useful links.

However, I don't think I'm going to go into this much detail just yet.  Specifically, we are writing a compiler and not a text rendering engine - and getting Unicode correct to this level is getting close to the typography realm, and can almost certainly be left as a display issue.

So my plan is to return a 'column number' in the diagnostic messages that refers to the Xth code point on that line, rather than the Xth code unit as today.  This should be a big improvement, but as you note will still be off by some columns when viewing our messages in a console window or dumped to a UTF-8 text file.  However, we *do* have correct information for IDEs to parse and get the right information to display accordingly.  Most importantly, this number will be correct if we are round-tripping with other Unicode encodings, such as reading UTF-16.

The main reason I want to duck this is that it is going to involve some rather large lookup tables for potentially several hundred thousand Unicode code points.  I'm not prepared to validate the data in such a table, and not sure about implementing an efficient lookup mechanism to avoid impacting performance - this is something much more appropriate for the bit-twiddlers than I (efficiency is something we all like to proclaim, but some folks are actually really good at it too!  I'm not one of them :(

And that brings me to the second issue I am going to duck.
I had originally planned on validating identifiers against ISO 10176, which is the Unicode specification for which code points should be usable as valid identifiers (vs. other forms of punctuation etc.)  Again, this lookup is going to be costly in a performance critical part of the code, so I will leave that task to someone with the right skill set.

So I'm self-selecting my problem down to reporting badly encoded UTF-8, and reporting the number of code points, rather than code units, on external messages.  Internally all code will continue to work in code units as the pointer arithmetic plain works!

I shall try to leave // FIXME tags in the appropriate places for someone to follow and provide the more subtle Unicode semantics.

AlisdairM

> -----Original Message-----
> From: Henrique Almeida [mailto:hdante at gmail.com]
> Sent: 15 June 2009 23:47
> To: AlisdairM(public)
> Cc: cfe-dev at cs.uiuc.edu
> Subject: Re: [cfe-dev] Handling Unicode : code points vs. code units
> 
>  Last post (today), I promise: (I didn't fully understand that but) it
> seems that, given any two adjacent code points in the source file,
> it's possible to determine if there's a grapheme break between them:
> 
>  http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakTest.html
> 
> 2009/6/15 Henrique Almeida <hdante at gmail.com>:
> >  I've found a default algorithm for finding character boundaries
> > (called graphemes in Unicode). The most general version of the
> > algorithm works like this: a grapheme is either "CRLF" or is composed
> > of zero or more "Prepend" characters, followed by either a
> > "Hangul-syllabe" or any other code point that is not a control,
> > followed by a sequence of "Grapheme_Extend" or "Spacing_Mark" code
> > points. So, every "character" in the editor should have this format.
> > Because of "Prepend" characters, it's not possible to use the
> > increment approach directly. In the worst case, a state machine would
> > do the job.
> >
> >  http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> >
> > 2009/6/15 Henrique Almeida <hdante at gmail.com>:
> >>  (I don't know how the code is structured, I'm answering from a
> >> conceptual point of view)
> >>
> >>  You could still use an integer for keeping track of the column
> number
> >> if it's possible to change the "increment" operation. IIRC Unicode
> has
> >> enough information to decide if the next code point will allocate a
> >> new graphical space or not (and combining code points are stored
> after
> >> the allocating code point). The new class does exactly the same, but
> >> the "increment" approach would avoid creating a new class.
> >>
> >>  PS: This wouldn't handle text that changes direction in the middle
> of
> >> the line, but I think editors don't deal with that either (gnome
> >> editor, for example, starts counting in reverse if I write some text
> >> from right to left). So, both line counts would be equally wrong. :-
> )
> >>
> >> 2009/6/15 AlisdairM(public) <public at alisdairm.net>:
> >>> Hopefully the last question before I start posting some patches on
> this!
> >>>
> >>> I think the big problem we face correctly handling extended
> characters in UTF-8 (and to a lesser extent UCNs in identifiers) is
> that much/all of the current code assumes that code points and code
> units are the same.
> >>>
> >>> In Unicode terms, the character is called a code point, which
> corresponds to the 21 bit value representing a single character from
> the Unicode character set.  This is a direct mapping in UTF-32
> encodings, but may require multiple 'code units' in UTF8/UTF-16.
> >>>
> >>> The effect shows up any time a character outside the basic 7-bit
> ASCII range shows up in a string literal or a comment.  The column
> numbers for any diagnostic on that line will be wrong beyond that
> character.
> >>>
> >>> Fundamentally, if we want to get this 'right' we should stop
> talking about characters and deal exclusively with character sequences.
>  Once we are dealing with UTF-8 we can no longer assume a 'character'
> will fit into a single 'char' variable.  I am not yet sure how
> pervasive such a change would be, as AFAICT most functions are already
> passing around pointers (in)to text buffers.  The difference may be
> more in how we approach the code than in the source itself.
> >>>
> >>> However, on place it really does matter is reporting column numbers
> in diagnostics.  We need to report column numbers in terms of
> characters, or code positions, rather than code units as today.
> >>>
> >>> In order to clarify when we are dealing explicitly with code
> positions, I propose to introduce a new class to describe such offsets
> rather than using a simple integer variable.  Thanks to inlining this
> class should have performance equivalent to a regular integer apart
> from in a couple of use cases.  The majority of the codebase should be
> unaffected, continuing to work in terms of byte offsets into a buffer.
>  However, whenever we need to render a source-file column number we
> should go via this new type.  The opaque type should catch many issues
> with code point vbs. Code unit at compile rather than runtime, although
> I don't have an exhaustive list of APIs that should be updated yet, so
> we must learn to alert for APIs using the wrong types as well.
> >>>
> >>>
> >>> A quick sketch of the class would look a little (or a lot!) like:
> >>>
> >>> #include <cstddef>
> >>>
> >>> struct CharacterPos {
> >>>   // We must declare a default constructor as there is another
> >>>   // user declared constructor in this class.
> >>>   // Choose to always initialize the position member. This means
> >>>   // that CharacterPos is not a POD class.  In C++0x we might
> >>>   // consider using = default, which might leave position
> >>>   // uninitialized, although 0x is more fine-grained in its
> >>>   // usage of PODs and trivial operations, so explicit
> initialization
> >>>   // is probably still the best choice.
> >>>   CharacterPos() : position() {}
> >>>
> >>>   // The remaining special members left implicit in order to
> >>>   // preserve triviality.  In C++0x would explicitly default them.
> >>> //   CharacterPos( CharacterPos const & rhs) = default;
> >>> //   ~CharacterPos() = default;
> >>> //   CharacterPos & operator=( CharacterPos const & rhs ) =
> default;
> >>>
> >>>   // Constructor iterates string from start to offset
> >>>   // counting UTF-8 characters i.e code points.
> >>>   // Throws an exception if str is not a valid UTF-8 encoding.
> >>>   CharacterPos( char const * str, std::size_t offset );
> >>>
> >>>   // Iterates str, returning a pointer to the initial code unit
> >>>   // of the UTF-8 character at 'position'.
> >>>   // Throws an exception if str is not a valid UTF-8 encoding.
> >>>   char const * Offset( char const * str ) const;
> >>>
> >>>   CharacterPos & operator+=( CharacterPos rhs ) {
> >>>      position += rhs.position;
> >>>      return *this;
> >>>   }
> >>>
> >>>   CharacterPos & operator-=( CharacterPos rhs ) {
> >>>      position -= rhs.position;
> >>>      return *this;
> >>>   }
> >>>
> >>>   std::ptrdiff_t operator-( CharacterPos rhs ) const {
> >>>      return position - rhs.position;
> >>>   }
> >>>
> >>>  bool operator==( CharacterPos rhs) const { return position ==
> rhs.position; }
> >>>  bool operator<( CharacterPos rhs) const { return position <
> rhs.position; }
> >>>  bool operator<=( CharacterPos rhs) const { return position <=
> rhs.position; }
> >>>  bool operator!=( CharacterPos rhs) const { return !(*this == rhs);
> }
> >>>  bool operator>( CharacterPos rhs) const { return rhs < *this; }
> >>>  bool operator>=( CharacterPos rhs) const { return rhs <= *this; }
> >>>
> >>>
> >>> private:
> >>>   std::size_t position;
> >>> };
> >>>
> >>> CharacterPos operator+( CharacterPos lhs, CharacterPos rhs ) {
> >>>   return lhs += rhs;
> >>> }
> >>>
> >>> char const * operator+( char const * lhs, CharacterPos rhs ) {
> >>>   return rhs.Offset( lhs );
> >>> }
> >>>
> >>>
> >>> Note that two operations in here have linear complexity rather than
> constant:
> >>>   CharacterPos( char const * str, std::size_t offset );
> >>>   char const * Offset( char const * str ) const;
> >>>
> >>> These are also the important APIs that define why the class exists.
> >>> In all other ways it should be a reasonable arithmetic type.
> >>>
> >>> I am opting for pass-by-value rather than pass-by-reference-to-
> const as that is typically more efficient for small data types,
> although obviously I have no performance measurements to back that up
> yet.
> >>>
> >>> Also note that these same two APIs have to deal with badly encoded
> UTF-8 streams, and indicate failure by throwing an exception.  I have
> informally picked up that LLVM/Clang prefer to avoid exceptions as
> error reporting mechanisms. If this is likely to be an issue I would
> appreciate guidance on an alternate error reporting mechanism for those
> same APIs - especially the failed constructor.
> >>>
> >>> AlisdairM
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> cfe-dev mailing list
> >>> cfe-dev at cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >>>
> >>
> >>
> >>
> >> --
> >>  Henrique Dante de Almeida
> >>  hdante at gmail.com
> >>
> >
> >
> >
> > --
> >  Henrique Dante de Almeida
> >  hdante at gmail.com
> >
> 
> 
> 
> --
>  Henrique Dante de Almeida
>  hdante at gmail.com






More information about the cfe-dev mailing list