[cfe-dev] UTF-8 vs. UTF-16 code locations

Mon Jan 25 08:49:39 PST 2016

On Mon, Jan 25, 2016 at 5:45 PM Milian Wolff via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson wrote:
> > If you ignore the existence of UTF16 surrogate pairs, then the mapping
> > is quite trivial and can be done very quickly.
> >
> > E.g. Certain range blocks of UTF16 code units map to a certain number of
> > UTF8 code units:
> >
> > 0x0000 - 0x007F -> 1 code unit
> > 0x0080 - 0x07FF -> 2 code units
> > 0x0800 - 0xFFFF -> 3 code units
> >
> > This allows you to quickly walk a line of UTF16 code units and get a
> > corresponding UTF8 code unit location.
> >
> > The converse is to check the high-order bits of the leading UTF8 code
> > unit to see how many to skip over to walk across a single UTF16 code
> unit.
>
> Thanks for the input!
>
> The missing step then for me is an efficient way to access the contents of
> a
> line. With clang-c, the only way I see is a costly clang_tokenize call. Is
> there an on the C++ side of clang? I see SourceManager::getCharacterData -
> would that be the right API to use? If so, I'll whip up a patch to make
> this
> accessible via clang-c, such that we can build a somewhat efficient mapping
> procedure on top of that.
>

Don't you already have the file as utf-8 so you can hand it into clang? Is
there a reason not to get the line out of that format?

>
> Thanks
>
> > On 2016-01-24 08:37 AM, Milian Wolff via cfe-dev wrote:
> > > Hey all,
> > >
> > > what would be the best way to get UTF-16 code locations from the
> clang-c
> > > API?
> > >
> > > As far as I can see it's not currently possible, and I wonder if it
> would
> > > be possible with the C++ API which I could then wrap in a new C
> function.
> > >
> > > The reason I'm asking is that we in KDevelop work with QString offsets
> in
> > > the editor, which is internally UTF-16 encoded. Now imagine we parse an
> > > UTF-8 encoded text file with the following contents:
> > >
> > > void foo() {
> > >
> > >    int c = 0;
> > >    /* ümlaut */ c++;
> > >
> > > }
> > >
> > > Any API in clang-c that takes or returns a column will be off-by-one
> from
> > > what we expect from an editor/UTF-16 column pov, due to the 'ü' which
> > > takes up two UTF-8 code points but just one UTF-16 code point. This
> > > breaks our highlighting and code browsing features, but thankfully such
> > > input is rare. I'd still like to fix it though if possible and if it
> > > doesn't cost too much runtime performance.
> > >
> > > What is the suggested way of handling this situation? Is there maybe
> prior
> > > art somewhere to efficiently translate between UTF-8/UTF-16 code
> > > locations that I could study?
> > >
> > > Thanks
> > >
> > >
> > > _______________________________________________
> > > cfe-dev mailing list
> > > cfe-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> --
> Milian Wolff
> mail at milianw.de
> http://milianw.de
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160125/913983fd/attachment.html>