[cfe-dev] UTF-8 vs. UTF-16 code locations

Mon Jan 25 10:23:26 PST 2016

On Monday, January 25, 2016 6:08:21 PM CET Manuel Klimek wrote:
> On Mon, Jan 25, 2016 at 6:52 PM Milian Wolff <mail at milianw.de> wrote:
> > On Monday, January 25, 2016 5:40:01 PM CET Manuel Klimek wrote:
> > > On Mon, Jan 25, 2016 at 5:58 PM Milian Wolff <mail at milianw.de> wrote:
> > > > On Monday, January 25, 2016 4:49:39 PM CET Manuel Klimek wrote:
> > > > > On Mon, Jan 25, 2016 at 5:45 PM Milian Wolff via cfe-dev <
> > > > > 
> > > > > cfe-dev at lists.llvm.org> wrote:
> > > > > > On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson
> > 
> > wrote:
> > > > > > > If you ignore the existence of UTF16 surrogate pairs, then the
> > > > 
> > > > mapping
> > > > 
> > > > > > > is quite trivial and can be done very quickly.
> > > > > > > 
> > > > > > > E.g. Certain range blocks of UTF16 code units map to a certain
> > > > 
> > > > number of
> > > > 
> > > > > > > UTF8 code units:
> > > > > > > 
> > > > > > > 0x0000 - 0x007F -> 1 code unit
> > > > > > > 0x0080 - 0x07FF -> 2 code units
> > > > > > > 0x0800 - 0xFFFF -> 3 code units
> > > > > > > 
> > > > > > > This allows you to quickly walk a line of UTF16 code units and
> > 
> > get a
> > 
> > > > > > > corresponding UTF8 code unit location.
> > > > > > > 
> > > > > > > The converse is to check the high-order bits of the leading UTF8
> > > > > > > code
> > > > > > > unit to see how many to skip over to walk across a single UTF16
> > 
> > code
> > 
> > > > > > unit.
> > > > > > 
> > > > > > Thanks for the input!
> > > > > > 
> > > > > > The missing step then for me is an efficient way to access the
> > > > 
> > > > contents of
> > > > 
> > > > > > a
> > > > > > line. With clang-c, the only way I see is a costly clang_tokenize
> > > > 
> > > > call. Is
> > > > 
> > > > > > there an on the C++ side of clang? I see
> > > > 
> > > > SourceManager::getCharacterData -
> > > > 
> > > > > > would that be the right API to use? If so, I'll whip up a patch to
> > > > > > make
> > > > > > this
> > > > > > accessible via clang-c, such that we can build a somewhat
> > > > > > efficient
> > > > > > mapping
> > > > > > procedure on top of that.
> > > > > 
> > > > > Don't you already have the file as utf-8 so you can hand it into
> > 
> > clang?
> > 
> > > > Is
> > > > 
> > > > > there a reason not to get the line out of that format?
> > > > 
> > > > Only those files I pass in via CXUnsavedFile I have access to. All
> > 
> > others
> > 
> > > > are
> > > > opened directly by clang. Considering that Clang already has access to
> > 
> > the
> > 
> > > > string contents of any file in the TU, that seems like the best
> > 
> > approach
> > 
> > > > for
> > > > me to access it, no?
> > > > 
> > > > From my quick glance over
> > 
> > https://code.woboq.org/llvm/clang/include/clang/
> > 
> > > > Basic/SourceManager.h.html#clang::SourceManager
> > > > <
> > 
> > https://code.woboq.org/llvm/clang/include/clang/Basic/SourceManager.h.htm
> > 
> > > > l#clang::SourceManager>>
> > > > 
> > > > I see the following potential candidates:
> > > >   SourceManager::getBufferData
> > > >   SourceManager::getCharacterData
> > > >   SourceManager::getBuffer + MemoryBuffer API
> > > > 
> > > > Wouldn't those fill the gap? Or do you think I (and any other IDE)
> > 
> > should
> > 
> > > > duplicate the code to find the contents of a given CXFile inside the
> > 
> > TU,
> > 
> > > > based
> > > > on either the CXUnsavedFile or an mmapped file from disk.
> > > 
> > > I'd have expected that you just read the files from disk yourself. I'd
> > > expect that to give fewer different code paths to do the same thing, so
> > 
> > I'd
> > 
> > > hope it reduces complexity. But in reality I have no idea what I'm
> > 
> > talking
> > 
> > > about as I don't know your codebase :)  I don't think that those design
> > > decisions can or should be made for all IDEs, so I'm not sure what other
> > > IDEs do is really relevant.
> > 
> > We do read the file ourselves when the user opens a file in the editor,
> > but
> > that is only a small fraction of those files that get parsed via
> > clang_parseTranslationUnit2. The majority of files will be read directly
> > from
> > disk from clang itself. The results we obtain from traversing the AST is
> > then
> > cached, most notably this stores ranges that need to be highlighted if a
> > file
> > gets opened eventually.
> > 
> > So that said, would you object against making any of the SourceManager::*
> > API
> > public via a new clang-c function? Assuming of course they do what I
> > expect
> > them to do, i.e. give me access to the file buffer (at a given position)
> > that
> > clang saw while parsing the TU? It would certainly make this task more
> > efficient to implement for us.
> 
> I'm probably still missing something: don't you only need to load the file
> if there's a result mentioning it and you want the user to open it?

Yes, but in KDevelop we parse all files of a project and cache the results. 
Once you open a file it will show the results from that cache and we can also 
use the locations in our cache for code browsing through the whole project to 
jump e.g. to where classes are defined etc. pp.

This is done without ever loading any file in an editor. But we do run a lot 
of clang_parseTranslationUnit2 calls which will internally open files from 
disk. Then we visit the AST and get e.g. the position for a class declaration.
In order to convert that position, assuming the file is UTF-8 encoded, I want 
to translate it to a UTF-16 position. For that I'd need efficient access to 
either the full file buffer, or, nicer even, the line buffer for this 
position.

Such an API that gives us direct access to the file/line buffer would also 
allow us to remove some other places where we currently have to use 
clang_tokenize for manual stringification of a range.

Thanks
-- 
Milian Wolff
mail at milianw.de
http://milianw.de