<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Mon, Jan 25, 2016 at 5:45 PM Milian Wolff via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson wrote:<br>
> If you ignore the existence of UTF16 surrogate pairs, then the mapping<br>
> is quite trivial and can be done very quickly.<br>
><br>
> E.g. Certain range blocks of UTF16 code units map to a certain number of<br>
> UTF8 code units:<br>
><br>
> 0x0000 - 0x007F -> 1 code unit<br>
> 0x0080 - 0x07FF -> 2 code units<br>
> 0x0800 - 0xFFFF -> 3 code units<br>
><br>
> This allows you to quickly walk a line of UTF16 code units and get a<br>
> corresponding UTF8 code unit location.<br>
><br>
> The converse is to check the high-order bits of the leading UTF8 code<br>
> unit to see how many to skip over to walk across a single UTF16 code unit.<br>
<br>
Thanks for the input!<br>
<br>
The missing step then for me is an efficient way to access the contents of a<br>
line. With clang-c, the only way I see is a costly clang_tokenize call. Is<br>
there an on the C++ side of clang? I see SourceManager::getCharacterData -<br>
would that be the right API to use? If so, I'll whip up a patch to make this<br>
accessible via clang-c, such that we can build a somewhat efficient mapping<br>
procedure on top of that.<br></blockquote><div><br></div><div>Don't you already have the file as utf-8 so you can hand it into clang? Is there a reason not to get the line out of that format?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Thanks<br>
<br>
> On 2016-01-24 08:37 AM, Milian Wolff via cfe-dev wrote:<br>
> > Hey all,<br>
> ><br>
> > what would be the best way to get UTF-16 code locations from the clang-c<br>
> > API?<br>
> ><br>
> > As far as I can see it's not currently possible, and I wonder if it would<br>
> > be possible with the C++ API which I could then wrap in a new C function.<br>
> ><br>
> > The reason I'm asking is that we in KDevelop work with QString offsets in<br>
> > the editor, which is internally UTF-16 encoded. Now imagine we parse an<br>
> > UTF-8 encoded text file with the following contents:<br>
> ><br>
> > void foo() {<br>
> ><br>
> > int c = 0;<br>
> > /* ümlaut */ c++;<br>
> ><br>
> > }<br>
> ><br>
> > Any API in clang-c that takes or returns a column will be off-by-one from<br>
> > what we expect from an editor/UTF-16 column pov, due to the 'ü' which<br>
> > takes up two UTF-8 code points but just one UTF-16 code point. This<br>
> > breaks our highlighting and code browsing features, but thankfully such<br>
> > input is rare. I'd still like to fix it though if possible and if it<br>
> > doesn't cost too much runtime performance.<br>
> ><br>
> > What is the suggested way of handling this situation? Is there maybe prior<br>
> > art somewhere to efficiently translate between UTF-8/UTF-16 code<br>
> > locations that I could study?<br>
> ><br>
> > Thanks<br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > cfe-dev mailing list<br>
> > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
<br>
<br>
--<br>
Milian Wolff<br>
<a href="mailto:mail@milianw.de" target="_blank">mail@milianw.de</a><br>
<a href="http://milianw.de" rel="noreferrer" target="_blank">http://milianw.de</a><br>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div></div>