[cfe-dev] UTF-8 vs. UTF-16 code locations

Mon Jan 25 07:10:56 PST 2016

Am 24.01.2016 um 19:54 schrieb Milian Wolff:
> I was aware of this. The question is more whether there is prior art in that
> aspect.

Well, ICU for being sure that the algorithms are really correct.
Not sure about prior art for keeping the offsets in sync. This sounds 
like a pretty standard editor task to me, which mostly follows from what 
data structures are already there and whether you

 > I expect that most other IDEs/editors that embed clang use UTF16
> because it is used by Qt, Windows and Java internally. I doubt we are the
> first to run into this issue.

I don't know about Qt.
Windows GDI uses code pages, and UTF-16 for file names. No real offset 
stuff in that. The editor components are essentially opaque, you throw 
in the whole text, cr/lf and all, and let the component do its thing. 
Windows editors use their own routines I suppose, so no real Windows 
support anyway.
I don't know what Windows with .Net does.
For Java, you usually slurp in the full file and don't even think about 
what the original coding was, until you write stuff back.

> If it turns out that we are actually the only ones with this issue so far then
> I'll leave this issue unresolved for now. Too bad, but the effort required to
> fix it from scratch seems to be quite high. I wonder whether the other IDEs
> thought the same ;-)

Eclipse can handle UTF-8 files quite fine. I think it's going the 
standard route for Java code - but then source code files are usually 
small enough that you can easily keep them in the heap.

Editing multi-gigabyte logs is an entirely different issue, and it's 
surprisingly hard to find an editor that can handle this scenario 
without going brick mode.

So... question is: What's your use case actually? Is it feasible to read 
and convert the file in one go, and never bother keeping a relationship 
to file positions until you write it back?

Regards,
Jo