[cfe-dev] UTF-8 vs. UTF-16 code locations

Joachim Durchholz via cfe-dev cfe-dev at lists.llvm.org
Sun Jan 24 06:55:46 PST 2016


Am 24.01.2016 um 14:37 schrieb Milian Wolff via cfe-dev:
> What is the suggested way of handling this situation? Is there maybe prior art
> somewhere to efficiently translate between UTF-8/UTF-16 code locations that I
> could study?

What Jörg said.

You may want to look at ICU, see http://site.icu-project.org/


Keep in mind that Unicode is more than just dealing with UTF-8 encoding. 
There is also:
* Multiple characters that the user expect to count as one. Czech think 
that ch is a single character, so they will expect the cursor to skip 
over ch as if it were a single character. Maybe the Czech don't mind if 
you don't get it right for them, I don't know - we're in the murky 
waters of cultural expectations here.
* Ligatures. I.e. single glyphs that are just two connected characters. 
You want to assume a character break inside the character.
* Right-to-left scripts. Particularly nasty if RTL and LTR are mixed, 
you will end having to highlight discontinous screen areas.
* Scripts where letters are written around the next letter. I forgot 
which script has this, it might be Devanagari.

Line wrapping has more fun of that kind.

You'll need to decide of much of these things are relevant for your user 
base.



More information about the cfe-dev mailing list