[cfe-dev] UTF-8 vs. UTF-16 code locations
Joachim Durchholz via cfe-dev
cfe-dev at lists.llvm.org
Mon Jan 25 07:10:56 PST 2016
Am 24.01.2016 um 19:54 schrieb Milian Wolff:
> I was aware of this. The question is more whether there is prior art in that
> aspect.
Well, ICU for being sure that the algorithms are really correct.
Not sure about prior art for keeping the offsets in sync. This sounds
like a pretty standard editor task to me, which mostly follows from what
data structures are already there and whether you
> I expect that most other IDEs/editors that embed clang use UTF16
> because it is used by Qt, Windows and Java internally. I doubt we are the
> first to run into this issue.
I don't know about Qt.
Windows GDI uses code pages, and UTF-16 for file names. No real offset
stuff in that. The editor components are essentially opaque, you throw
in the whole text, cr/lf and all, and let the component do its thing.
Windows editors use their own routines I suppose, so no real Windows
support anyway.
I don't know what Windows with .Net does.
For Java, you usually slurp in the full file and don't even think about
what the original coding was, until you write stuff back.
> If it turns out that we are actually the only ones with this issue so far then
> I'll leave this issue unresolved for now. Too bad, but the effort required to
> fix it from scratch seems to be quite high. I wonder whether the other IDEs
> thought the same ;-)
Eclipse can handle UTF-8 files quite fine. I think it's going the
standard route for Java code - but then source code files are usually
small enough that you can easily keep them in the heap.
Editing multi-gigabyte logs is an entirely different issue, and it's
surprisingly hard to find an editor that can handle this scenario
without going brick mode.
So... question is: What's your use case actually? Is it feasible to read
and convert the file in one go, and never bother keeping a relationship
to file positions until you write it back?
Regards,
Jo
More information about the cfe-dev
mailing list