<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Jan 25, 2016 at 5:45 PM Milian Wolff via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson wrote:<br>

> If you ignore the existence of UTF16 surrogate pairs, then the mapping<br>

> is quite trivial and can be done very quickly.<br>

><br>

> E.g. Certain range blocks of UTF16 code units map to a certain number of<br>

> UTF8 code units:<br>

><br>

> 0x0000 - 0x007F -> 1 code unit<br>

> 0x0080 - 0x07FF -> 2 code units<br>

> 0x0800 - 0xFFFF -> 3 code units<br>

><br>

> This allows you to quickly walk a line of UTF16 code units and get a<br>

> corresponding UTF8 code unit location.<br>

><br>

> The converse is to check the high-order bits of the leading UTF8 code<br>

> unit to see how many to skip over to walk across a single UTF16 code unit.<br>

<br>

Thanks for the input!<br>

<br>

The missing step then for me is an efficient way to access the contents of a<br>

line. With clang-c, the only way I see is a costly clang_tokenize call. Is<br>

there an on the C++ side of clang? I see SourceManager::getCharacterData -<br>

would that be the right API to use? If so, I'll whip up a patch to make this<br>

accessible via clang-c, such that we can build a somewhat efficient mapping<br>

procedure on top of that.<br></blockquote><div><br></div><div>Don't you already have the file as utf-8 so you can hand it into clang? Is there a reason not to get the line out of that format?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Thanks<br>

<br>

> On 2016-01-24 08:37 AM, Milian Wolff via cfe-dev wrote:<br>

> > Hey all,<br>

> ><br>

> > what would be the best way to get UTF-16 code locations from the clang-c<br>

> > API?<br>

> ><br>

> > As far as I can see it's not currently possible, and I wonder if it would<br>

> > be possible with the C++ API which I could then wrap in a new C function.<br>

> ><br>

> > The reason I'm asking is that we in KDevelop work with QString offsets in<br>

> > the editor, which is internally UTF-16 encoded. Now imagine we parse an<br>

> > UTF-8 encoded text file with the following contents:<br>

> ><br>

> > void foo() {<br>

> ><br>

> >    int c = 0;<br>

> >    /* ümlaut */ c++;<br>

> ><br>

> > }<br>

> ><br>

> > Any API in clang-c that takes or returns a column will be off-by-one from<br>

> > what we expect from an editor/UTF-16 column pov, due to the 'ü' which<br>

> > takes up two UTF-8 code points but just one UTF-16 code point. This<br>

> > breaks our highlighting and code browsing features, but thankfully such<br>

> > input is rare. I'd still like to fix it though if possible and if it<br>

> > doesn't cost too much runtime performance.<br>

> ><br>

> > What is the suggested way of handling this situation? Is there maybe prior<br>

> > art somewhere to efficiently translate between UTF-8/UTF-16 code<br>

> > locations that I could study?<br>

> ><br>

> > Thanks<br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > cfe-dev mailing list<br>

> > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

<br>

<br>

--<br>

Milian Wolff<br>

<a href="mailto:mail@milianw.de" target="_blank">mail@milianw.de</a><br>

<a href="http://milianw.de" rel="noreferrer" target="_blank">http://milianw.de</a><br>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div></div>