[cfe-dev] UTF-8 vs. UTF-16 code locations

Milian Wolff via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 25 12:18:47 PST 2016


On Montag, 25. Januar 2016 20:39:44 CET Joachim Durchholz via cfe-dev wrote:
> Am 25.01.2016 um 19:23 schrieb Milian Wolff via cfe-dev:
> > This is done without ever loading any file in an editor. But we do run a
> > lot of clang_parseTranslationUnit2 calls which will internally open files
> > from disk. Then we visit the AST and get e.g. the position for a class
> > declaration. In order to convert that position, assuming the file is
> > UTF-8 encoded, I want to translate it to a UTF-16 position.
> 
> Can't you convert to UTF-16 during load? Then you don't need to
> translate at all.
> I'm under the impression that you are keeping an UTF-8 data blob in an
> environment that mostly talks UTF-16; in that case, the cleanest
> solution would be to have the data blob in UTF-16, too. Of course I
> don't know how much of your code base you'd have to touch to change
> that, this could be quite nasty or surprisingly easy.

What data blob are you referring to? I have the feeling we are talking past 
each other in this discussion ;-)

on one hand I have:

for every file in given directory
	call clang_parseTranslationUnit
	traverse resulting AST
		for every interesting cursor
			store range of this cursor

The data blob we cache is a range[start(line, column), end(line, column)]. The 
large code base expect this to be UTF-16 column offsets. Assuming the file is 
encoded in UTF-8 on-disk then this is what I'll get from clang-c. For that 
reason I'd like to convert it at this point. An API in clang-c for efficient 
access to the underlying UTF-8 buffer of a given CXFile would help a lot for 
that purpose (and in other scenarios we currently (ab)use clang_tokenize to 
stringify a range).

So what I'm asking, again, is whether an API such as the following would be 
acceptable:

CXString clang_getRangeSpelling(CXSourceRange range);

Thanks

-- 
Milian Wolff
mail at milianw.de
http://milianw.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160125/0f7862b4/attachment.sig>


More information about the cfe-dev mailing list