[cfe-dev] UTF-8 vs. UTF-16 code locations
Milian Wolff via cfe-dev
cfe-dev at lists.llvm.org
Mon Jan 25 08:45:00 PST 2016
On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson wrote:
> If you ignore the existence of UTF16 surrogate pairs, then the mapping
> is quite trivial and can be done very quickly.
>
> E.g. Certain range blocks of UTF16 code units map to a certain number of
> UTF8 code units:
>
> 0x0000 - 0x007F -> 1 code unit
> 0x0080 - 0x07FF -> 2 code units
> 0x0800 - 0xFFFF -> 3 code units
>
> This allows you to quickly walk a line of UTF16 code units and get a
> corresponding UTF8 code unit location.
>
> The converse is to check the high-order bits of the leading UTF8 code
> unit to see how many to skip over to walk across a single UTF16 code unit.
Thanks for the input!
The missing step then for me is an efficient way to access the contents of a
line. With clang-c, the only way I see is a costly clang_tokenize call. Is
there an on the C++ side of clang? I see SourceManager::getCharacterData -
would that be the right API to use? If so, I'll whip up a patch to make this
accessible via clang-c, such that we can build a somewhat efficient mapping
procedure on top of that.
Thanks
> On 2016-01-24 08:37 AM, Milian Wolff via cfe-dev wrote:
> > Hey all,
> >
> > what would be the best way to get UTF-16 code locations from the clang-c
> > API?
> >
> > As far as I can see it's not currently possible, and I wonder if it would
> > be possible with the C++ API which I could then wrap in a new C function.
> >
> > The reason I'm asking is that we in KDevelop work with QString offsets in
> > the editor, which is internally UTF-16 encoded. Now imagine we parse an
> > UTF-8 encoded text file with the following contents:
> >
> > void foo() {
> >
> > int c = 0;
> > /* ümlaut */ c++;
> >
> > }
> >
> > Any API in clang-c that takes or returns a column will be off-by-one from
> > what we expect from an editor/UTF-16 column pov, due to the 'ü' which
> > takes up two UTF-8 code points but just one UTF-16 code point. This
> > breaks our highlighting and code browsing features, but thankfully such
> > input is rare. I'd still like to fix it though if possible and if it
> > doesn't cost too much runtime performance.
> >
> > What is the suggested way of handling this situation? Is there maybe prior
> > art somewhere to efficiently translate between UTF-8/UTF-16 code
> > locations that I could study?
> >
> > Thanks
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
--
Milian Wolff
mail at milianw.de
http://milianw.de
More information about the cfe-dev
mailing list