[cfe-dev] UTF-8 vs. UTF-16 code locations

Joachim Durchholz via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 25 13:54:06 PST 2016


Am 25.01.2016 um 21:18 schrieb Milian Wolff:
> On Montag, 25. Januar 2016 20:39:44 CET Joachim Durchholz via cfe-dev wrote:
>> Am 25.01.2016 um 19:23 schrieb Milian Wolff via cfe-dev:
>>> This is done without ever loading any file in an editor. But we do run a
>>> lot of clang_parseTranslationUnit2 calls which will internally open files
>>> from disk. Then we visit the AST and get e.g. the position for a class
>>> declaration. In order to convert that position, assuming the file is
>>> UTF-8 encoded, I want to translate it to a UTF-16 position.
>>
>> Can't you convert to UTF-16 during load? Then you don't need to
>> translate at all.
>> I'm under the impression that you are keeping an UTF-8 data blob in an
>> environment that mostly talks UTF-16; in that case, the cleanest
>> solution would be to have the data blob in UTF-16, too. Of course I
>> don't know how much of your code base you'd have to touch to change
>> that, this could be quite nasty or surprisingly easy.
>
> What data blob are you referring to? I have the feeling we are talking past
> each other in this discussion ;-)

Then I'm not seeing or missed where you have UTF-8 data.

> on one hand I have:
>
> for every file in given directory
> 	call clang_parseTranslationUnit
> 	traverse resulting AST
> 		for every interesting cursor
> 			store range of this cursor
>
> The data blob we cache is a range[start(line, column), end(line, column)].

I understand that the column index values here may be wrong.
Is that correct?

 > The large code base expect this to be UTF-16 column offsets.

Okay, then we're in the same boat with this.

 > Assuming the file is
> encoded in UTF-8 on-disk then this is what I'll get from clang-c.

Does clang-c consider each byte of a multibyte UTF-8 encoding as a 
character taking up its own column?
In that case, all you need to do is to file a bug :-)

I just stumbled upon clang::SourceManager.createExpansionLoc. I don't 
know how this will interact with macro expansion though.

 > For that
> reason I'd like to convert it at this point. An API in clang-c for efficient
> access to the underlying UTF-8 buffer of a given CXFile would help a lot for
> that purpose (and in other scenarios we currently (ab)use clang_tokenize to
> stringify a range).

Hmm... I guess that would be exposing data structures that are currently 
internal. I have no idea how much the clang folks will like that; maybe 
there were plans to open that anyway, maybe they don't want to because 
there are other plans.
Oh. There is already clang::SourceManager.getMemoryBufferForFile.

> So what I'm asking, again, is whether an API such as the following would be
> acceptable:
>
> CXString clang_getRangeSpelling(CXSourceRange range);

Somebody with more knowledge will have to answer that.
(I generally know a lot more about Unicode than about clang; my clang 
knowledge is very, very basic.)



More information about the cfe-dev mailing list