[PATCH] UTF-8 support for clang-format.

James Dennett jdennett at googlers.com
Tue Jun 4 21:11:41 PDT 2013


On Tue, Jun 4, 2013 at 8:34 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
>
>   Are code points the correct thing to count here?  There are combining characters, there are double-width characters.  I think that the CJK number tests pass for the wrong reason -- the width of characters is counted as 1, not as 2, as it would be displayed on the terminal.

In my experience, there are two reasonable choices, and many bad ones.
 The reasonable ones are to count bytes, or to count code points.
Attempting to count characters is doomed, and attempting to determine
the column is something that can only really be done robustly by a
program that renders text.

Of the two reasonable choices (bytes or code points), and assuming
Unicode, code points have the advantage of being independent of the
transformation format.  Compared to characters, they have the virtue
of being unambiguously defined.

-- James




More information about the cfe-commits mailing list