[cfe-dev] UTF-8 vs. UTF-16 code locations
Joachim Durchholz via cfe-dev
cfe-dev at lists.llvm.org
Sun Jan 24 06:55:46 PST 2016
Am 24.01.2016 um 14:37 schrieb Milian Wolff via cfe-dev:
> What is the suggested way of handling this situation? Is there maybe prior art
> somewhere to efficiently translate between UTF-8/UTF-16 code locations that I
> could study?
What Jörg said.
You may want to look at ICU, see http://site.icu-project.org/
Keep in mind that Unicode is more than just dealing with UTF-8 encoding.
There is also:
* Multiple characters that the user expect to count as one. Czech think
that ch is a single character, so they will expect the cursor to skip
over ch as if it were a single character. Maybe the Czech don't mind if
you don't get it right for them, I don't know - we're in the murky
waters of cultural expectations here.
* Ligatures. I.e. single glyphs that are just two connected characters.
You want to assume a character break inside the character.
* Right-to-left scripts. Particularly nasty if RTL and LTR are mixed,
you will end having to highlight discontinous screen areas.
* Scripts where letters are written around the next letter. I forgot
which script has this, it might be Devanagari.
Line wrapping has more fun of that kind.
You'll need to decide of much of these things are relevant for your user
base.
More information about the cfe-dev
mailing list