[cfe-dev] UTF-8 vs. UTF-16 code locations

Joerg Sonnenberger via cfe-dev cfe-dev at lists.llvm.org
Sun Jan 24 06:22:44 PST 2016


On Sun, Jan 24, 2016 at 02:37:38PM +0100, Milian Wolff via cfe-dev wrote:
> The reason I'm asking is that we in KDevelop work with QString offsets in the 
> editor, which is internally UTF-16 encoded. Now imagine we parse an UTF-8 
> encoded text file with the following contents:

If your input is UTF-8 and you are internally handling it as UTF-16, you
will need to keep a mapping table. As both UTF-8 and UTF-16 are
variable width encodings (e.g. a given Unicode character can map to a
varying number of UTF-8 or UTF-16 'characters'), it is not possible to
create a static mapping function. You don't necessarily have to map
every character. Since both encodings are essentially state-free, it is
enough to have a starting point earlier and do decoding from that.

Joerg



More information about the cfe-dev mailing list