[cfe-dev] UTF-8 vs. UTF-16 code locations

Manuel Klimek via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 25 10:08:21 PST 2016


On Mon, Jan 25, 2016 at 6:52 PM Milian Wolff <mail at milianw.de> wrote:

> On Monday, January 25, 2016 5:40:01 PM CET Manuel Klimek wrote:
> > On Mon, Jan 25, 2016 at 5:58 PM Milian Wolff <mail at milianw.de> wrote:
> > > On Monday, January 25, 2016 4:49:39 PM CET Manuel Klimek wrote:
> > > > On Mon, Jan 25, 2016 at 5:45 PM Milian Wolff via cfe-dev <
> > > >
> > > > cfe-dev at lists.llvm.org> wrote:
> > > > > On Monday, January 25, 2016 11:25:22 AM CET Halfdan Ingvarsson
> wrote:
> > > > > > If you ignore the existence of UTF16 surrogate pairs, then the
> > >
> > > mapping
> > >
> > > > > > is quite trivial and can be done very quickly.
> > > > > >
> > > > > > E.g. Certain range blocks of UTF16 code units map to a certain
> > >
> > > number of
> > >
> > > > > > UTF8 code units:
> > > > > >
> > > > > > 0x0000 - 0x007F -> 1 code unit
> > > > > > 0x0080 - 0x07FF -> 2 code units
> > > > > > 0x0800 - 0xFFFF -> 3 code units
> > > > > >
> > > > > > This allows you to quickly walk a line of UTF16 code units and
> get a
> > > > > > corresponding UTF8 code unit location.
> > > > > >
> > > > > > The converse is to check the high-order bits of the leading UTF8
> > > > > > code
> > > > > > unit to see how many to skip over to walk across a single UTF16
> code
> > > > >
> > > > > unit.
> > > > >
> > > > > Thanks for the input!
> > > > >
> > > > > The missing step then for me is an efficient way to access the
> > >
> > > contents of
> > >
> > > > > a
> > > > > line. With clang-c, the only way I see is a costly clang_tokenize
> > >
> > > call. Is
> > >
> > > > > there an on the C++ side of clang? I see
> > >
> > > SourceManager::getCharacterData -
> > >
> > > > > would that be the right API to use? If so, I'll whip up a patch to
> > > > > make
> > > > > this
> > > > > accessible via clang-c, such that we can build a somewhat efficient
> > > > > mapping
> > > > > procedure on top of that.
> > > >
> > > > Don't you already have the file as utf-8 so you can hand it into
> clang?
> > >
> > > Is
> > >
> > > > there a reason not to get the line out of that format?
> > >
> > > Only those files I pass in via CXUnsavedFile I have access to. All
> others
> > > are
> > > opened directly by clang. Considering that Clang already has access to
> the
> > > string contents of any file in the TU, that seems like the best
> approach
> > > for
> > > me to access it, no?
> > >
> > > From my quick glance over
> https://code.woboq.org/llvm/clang/include/clang/
> > > Basic/SourceManager.h.html#clang::SourceManager
> > > <
> https://code.woboq.org/llvm/clang/include/clang/Basic/SourceManager.h.htm
> > > l#clang::SourceManager>>
> > > I see the following potential candidates:
> > >   SourceManager::getBufferData
> > >   SourceManager::getCharacterData
> > >   SourceManager::getBuffer + MemoryBuffer API
> > >
> > > Wouldn't those fill the gap? Or do you think I (and any other IDE)
> should
> > > duplicate the code to find the contents of a given CXFile inside the
> TU,
> > > based
> > > on either the CXUnsavedFile or an mmapped file from disk.
> >
> > I'd have expected that you just read the files from disk yourself. I'd
> > expect that to give fewer different code paths to do the same thing, so
> I'd
> > hope it reduces complexity. But in reality I have no idea what I'm
> talking
> > about as I don't know your codebase :)  I don't think that those design
> > decisions can or should be made for all IDEs, so I'm not sure what other
> > IDEs do is really relevant.
>
> We do read the file ourselves when the user opens a file in the editor, but
> that is only a small fraction of those files that get parsed via
> clang_parseTranslationUnit2. The majority of files will be read directly
> from
> disk from clang itself. The results we obtain from traversing the AST is
> then
> cached, most notably this stores ranges that need to be highlighted if a
> file
> gets opened eventually.
>
> So that said, would you object against making any of the SourceManager::*
> API
> public via a new clang-c function? Assuming of course they do what I expect
> them to do, i.e. give me access to the file buffer (at a given position)
> that
> clang saw while parsing the TU? It would certainly make this task more
> efficient to implement for us.
>

I'm probably still missing something: don't you only need to load the file
if there's a result mentioning it and you want the user to open it?



>
> Thanks
> --
> Milian Wolff
> mail at milianw.de
> http://milianw.de
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160125/874e9cd4/attachment.html>


More information about the cfe-dev mailing list