[cfe-dev] UTF-8 vs. UTF-16 code locations
Milian Wolff via cfe-dev
cfe-dev at lists.llvm.org
Sun Jan 24 05:37:38 PST 2016
Hey all,
what would be the best way to get UTF-16 code locations from the clang-c API?
As far as I can see it's not currently possible, and I wonder if it would be
possible with the C++ API which I could then wrap in a new C function.
The reason I'm asking is that we in KDevelop work with QString offsets in the
editor, which is internally UTF-16 encoded. Now imagine we parse an UTF-8
encoded text file with the following contents:
void foo() {
int c = 0;
/* ümlaut */ c++;
}
Any API in clang-c that takes or returns a column will be off-by-one from what
we expect from an editor/UTF-16 column pov, due to the 'ü' which takes up two
UTF-8 code points but just one UTF-16 code point. This breaks our highlighting
and code browsing features, but thankfully such input is rare. I'd still like
to fix it though if possible and if it doesn't cost too much runtime
performance.
What is the suggested way of handling this situation? Is there maybe prior art
somewhere to efficiently translate between UTF-8/UTF-16 code locations that I
could study?
Thanks
--
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160124/56b4997e/attachment.sig>
More information about the cfe-dev
mailing list