[cfe-dev] UTF-8 vs. UTF-16 code locations

Mon Jan 25 08:45:00 PST 2016

On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson wrote:
> If you ignore the existence of UTF16 surrogate pairs, then the mapping
> is quite trivial and can be done very quickly.
> 
> E.g. Certain range blocks of UTF16 code units map to a certain number of
> UTF8 code units:
> 
> 0x0000 - 0x007F -> 1 code unit
> 0x0080 - 0x07FF -> 2 code units
> 0x0800 - 0xFFFF -> 3 code units
> 
> This allows you to quickly walk a line of UTF16 code units and get a
> corresponding UTF8 code unit location.
> 
> The converse is to check the high-order bits of the leading UTF8 code
> unit to see how many to skip over to walk across a single UTF16 code unit.

Thanks for the input!

The missing step then for me is an efficient way to access the contents of a 
line. With clang-c, the only way I see is a costly clang_tokenize call. Is 
there an on the C++ side of clang? I see SourceManager::getCharacterData - 
would that be the right API to use? If so, I'll whip up a patch to make this 
accessible via clang-c, such that we can build a somewhat efficient mapping 
procedure on top of that.

Thanks

> On 2016-01-24 08:37 AM, Milian Wolff via cfe-dev wrote:
> > Hey all,
> > 
> > what would be the best way to get UTF-16 code locations from the clang-c
> > API?
> > 
> > As far as I can see it's not currently possible, and I wonder if it would
> > be possible with the C++ API which I could then wrap in a new C function.
> > 
> > The reason I'm asking is that we in KDevelop work with QString offsets in
> > the editor, which is internally UTF-16 encoded. Now imagine we parse an
> > UTF-8 encoded text file with the following contents:
> > 
> > void foo() {
> > 
> >    int c = 0;
> >    /* ümlaut */ c++;
> > 
> > }
> > 
> > Any API in clang-c that takes or returns a column will be off-by-one from
> > what we expect from an editor/UTF-16 column pov, due to the 'ü' which
> > takes up two UTF-8 code points but just one UTF-16 code point. This
> > breaks our highlighting and code browsing features, but thankfully such
> > input is rare. I'd still like to fix it though if possible and if it
> > doesn't cost too much runtime performance.
> > 
> > What is the suggested way of handling this situation? Is there maybe prior
> > art somewhere to efficiently translate between UTF-8/UTF-16 code
> > locations that I could study?
> > 
> > Thanks
> > 
> > 
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Milian Wolff
mail at milianw.de
http://milianw.de