<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Sun, Jan 24, 2016 at 7:54 PM Milian Wolff via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Sonntag, 24. Januar 2016 15:55:46 CET Joachim Durchholz via cfe-dev wrote:<br>
> Am 24.01.2016 um 14:37 schrieb Milian Wolff via cfe-dev:<br>
> > What is the suggested way of handling this situation? Is there maybe prior<br>
> > art somewhere to efficiently translate between UTF-8/UTF-16 code<br>
> > locations that I could study?<br>
><br>
> What Jörg said.<br>
><br>
> You may want to look at ICU, see <a href="http://site.icu-project.org/" rel="noreferrer" target="_blank">http://site.icu-project.org/</a><br>
><br>
><br>
> Keep in mind that Unicode is more than just dealing with UTF-8 encoding.<br>
> There is also:<br>
> * Multiple characters that the user expect to count as one. Czech think<br>
> that ch is a single character, so they will expect the cursor to skip<br>
> over ch as if it were a single character. Maybe the Czech don't mind if<br>
> you don't get it right for them, I don't know - we're in the murky<br>
> waters of cultural expectations here.<br>
> * Ligatures. I.e. single glyphs that are just two connected characters.<br>
> You want to assume a character break inside the character.<br>
> * Right-to-left scripts. Particularly nasty if RTL and LTR are mixed,<br>
> you will end having to highlight discontinous screen areas.<br>
> * Scripts where letters are written around the next letter. I forgot<br>
> which script has this, it might be Devanagari.<br>
><br>
> Line wrapping has more fun of that kind.<br>
><br>
> You'll need to decide of much of these things are relevant for your user<br>
> base.<br>
<br>
Thanks guys,<br>
<br>
I was aware of this. The question is more whether there is prior art in that<br>
aspect. I expect that most other IDEs/editors that embed clang use UTF16<br>
because it is used by Qt, Windows and Java internally. I doubt we are the<br>
first to run into this issue.<br>
<br>
If it turns out that we are actually the only ones with this issue so far then<br>
I'll leave this issue unresolved for now. Too bad, but the effort required to<br>
fix it from scratch seems to be quite high. I wonder whether the other IDEs<br>
thought the same ;-)<br></blockquote><div><br></div><div>I'm not sure what you're looking for - if you interface with a tool you need to convert the data you have into the format the tool expects and back.</div><div>You'll need to convert your internal representation to utf8 and back anyway, so you'll also need to convert the offsets from clang if it doesn't give you character offsets already.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Bye<br>
--<br>
Milian Wolff<br>
<a href="mailto:mail@milianw.de" target="_blank">mail@milianw.de</a><br>
<a href="http://milianw.de" rel="noreferrer" target="_blank">http://milianw.de</a>_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div></div>