[cfe-dev] UTF-8 vs. UTF-16 code locations

Milian Wolff via cfe-dev cfe-dev at lists.llvm.org
Sun Jan 24 10:54:04 PST 2016


On Sonntag, 24. Januar 2016 15:55:46 CET Joachim Durchholz via cfe-dev wrote:
> Am 24.01.2016 um 14:37 schrieb Milian Wolff via cfe-dev:
> > What is the suggested way of handling this situation? Is there maybe prior
> > art somewhere to efficiently translate between UTF-8/UTF-16 code
> > locations that I could study?
> 
> What Jörg said.
> 
> You may want to look at ICU, see http://site.icu-project.org/
> 
> 
> Keep in mind that Unicode is more than just dealing with UTF-8 encoding.
> There is also:
> * Multiple characters that the user expect to count as one. Czech think
> that ch is a single character, so they will expect the cursor to skip
> over ch as if it were a single character. Maybe the Czech don't mind if
> you don't get it right for them, I don't know - we're in the murky
> waters of cultural expectations here.
> * Ligatures. I.e. single glyphs that are just two connected characters.
> You want to assume a character break inside the character.
> * Right-to-left scripts. Particularly nasty if RTL and LTR are mixed,
> you will end having to highlight discontinous screen areas.
> * Scripts where letters are written around the next letter. I forgot
> which script has this, it might be Devanagari.
> 
> Line wrapping has more fun of that kind.
> 
> You'll need to decide of much of these things are relevant for your user
> base.

Thanks guys,

I was aware of this. The question is more whether there is prior art in that 
aspect. I expect that most other IDEs/editors that embed clang use UTF16 
because it is used by Qt, Windows and Java internally. I doubt we are the 
first to run into this issue.

If it turns out that we are actually the only ones with this issue so far then 
I'll leave this issue unresolved for now. Too bad, but the effort required to 
fix it from scratch seems to be quite high. I wonder whether the other IDEs 
thought the same ;-)

Bye
-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160124/29e8992b/attachment.sig>


More information about the cfe-dev mailing list