<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    If you ignore the existence of UTF16 surrogate pairs, then the

    mapping is quite trivial and can be done very quickly. <br>

    <br>

    E.g. Certain range blocks of UTF16 code units map to a certain

    number of UTF8 code units:<br>

    <br>

    0x0000 - 0x007F -> 1 code unit<br>

    0x0080 - 0x07FF -> 2 code units<br>

    0x0800 - 0xFFFF -> 3 code units<br>

    <br>

    This allows you to quickly walk a line of UTF16 code units and get a

    corresponding UTF8 code unit location.<br>

    <br>

    The converse is to check the high-order bits of the leading UTF8

    code unit to see how many to skip over to walk across a single UTF16

    code unit.<br>

    <br>

     - ½<br>

    <br>

    <div class="moz-cite-prefix">On 2016-01-24 08:37 AM, Milian Wolff

      via cfe-dev wrote:<br>

    </div>

    <blockquote cite="mid:39300928.avIgn78Ya0@agathebauer" type="cite">

      <pre wrap="">Hey all,

what would be the best way to get UTF-16 code locations from the clang-c API?

As far as I can see it's not currently possible, and I wonder if it would be 

possible with the C++ API which I could then wrap in a new C function.

The reason I'm asking is that we in KDevelop work with QString offsets in the 

editor, which is internally UTF-16 encoded. Now imagine we parse an UTF-8 

encoded text file with the following contents:

void foo() {

  int c = 0;

  /* ümlaut */ c++;

}

Any API in clang-c that takes or returns a column will be off-by-one from what 

we expect from an editor/UTF-16 column pov, due to the 'ü' which takes up two 

UTF-8 code points but just one UTF-16 code point. This breaks our highlighting 

and code browsing features, but thankfully such input is rare. I'd still like 

to fix it though if possible and if it doesn't cost too much runtime 

performance.

What is the suggested way of handling this situation? Is there maybe prior art 

somewhere to efficiently translate between UTF-8/UTF-16 code locations that I 

could study?

Thanks

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

cfe-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>