[cfe-dev] UTF-8 vs. UTF-16 code locations

Manuel Klimek via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 25 06:25:43 PST 2016


On Sun, Jan 24, 2016 at 7:54 PM Milian Wolff via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> On Sonntag, 24. Januar 2016 15:55:46 CET Joachim Durchholz via cfe-dev
> wrote:
> > Am 24.01.2016 um 14:37 schrieb Milian Wolff via cfe-dev:
> > > What is the suggested way of handling this situation? Is there maybe
> prior
> > > art somewhere to efficiently translate between UTF-8/UTF-16 code
> > > locations that I could study?
> >
> > What Jörg said.
> >
> > You may want to look at ICU, see http://site.icu-project.org/
> >
> >
> > Keep in mind that Unicode is more than just dealing with UTF-8 encoding.
> > There is also:
> > * Multiple characters that the user expect to count as one. Czech think
> > that ch is a single character, so they will expect the cursor to skip
> > over ch as if it were a single character. Maybe the Czech don't mind if
> > you don't get it right for them, I don't know - we're in the murky
> > waters of cultural expectations here.
> > * Ligatures. I.e. single glyphs that are just two connected characters.
> > You want to assume a character break inside the character.
> > * Right-to-left scripts. Particularly nasty if RTL and LTR are mixed,
> > you will end having to highlight discontinous screen areas.
> > * Scripts where letters are written around the next letter. I forgot
> > which script has this, it might be Devanagari.
> >
> > Line wrapping has more fun of that kind.
> >
> > You'll need to decide of much of these things are relevant for your user
> > base.
>
> Thanks guys,
>
> I was aware of this. The question is more whether there is prior art in
> that
> aspect. I expect that most other IDEs/editors that embed clang use UTF16
> because it is used by Qt, Windows and Java internally. I doubt we are the
> first to run into this issue.
>
> If it turns out that we are actually the only ones with this issue so far
> then
> I'll leave this issue unresolved for now. Too bad, but the effort required
> to
> fix it from scratch seems to be quite high. I wonder whether the other IDEs
> thought the same ;-)
>

I'm not sure what you're looking for - if you interface with a tool you
need to convert the data you have into the format the tool expects and back.
You'll need to convert your internal representation to utf8 and back
anyway, so you'll also need to convert the offsets from clang if it doesn't
give you character offsets already.


>
> Bye
> --
> Milian Wolff
> mail at milianw.de
> http://milianw.de_______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160125/714ad513/attachment.html>


More information about the cfe-dev mailing list