[cfe-dev] tabs in the input and their effect on the column position

Tue Mar 12 18:33:58 PDT 2013

Adding an option to include whitespace tokens in clang_tokenize is a valid feature request. Since all we're doing is lexing (not parsing), Clang certainly has this capability; it's just not exposed. Please file a feature request at http://llvm.org/bugs/.

Just to warn you, if output is misaligned for tabs, it could potentially be misaligned for wide characters as well. But perhaps that's not important in your current use case; I'm glad you found a solution that works.

Jordan

On Mar 12, 2013, at 13:57 , Dimitri van Heesch <dimitri at stack.nl> wrote:

> Hi Jordan,
> 
> Thanks for the info.
> 
> It would have been nice if clang_tokenize would have included whitespace as tokens (if only as 
> an option). Then one could reconstruct the output by purely looking at the tokens. Without it one 
> needs to extract the whitespace between the tokens in the original file and analyse it to 
> see how it should be rendered in the output. Doesn't sound very efficient nor convenient.
> 
> Meanwhile I found out how to use CXUnsavedFile to pass a de-tabbed representation of the 
> original file, which works for me. Note that in my case both input and output are UTF-8 
> encoded, so the multi-byte characters are just passed through fine.
> 
> Regards,
>  Dimitri
> 
> On Mar 10, 2013, at 8:42 , Jordan Rose <jordan_rose at apple.com> wrote:
> 
>> I want to jump in before anyone starts suggesting solutions, and point out that all libclang output is in terms of bytes from the start of a line. That means that tabs show up as one byte, but it also means that if the source contains multibyte characters, you may have a range of three bytes referring to a single character (and a single column). 
>> 
>> Clang and libclang expect you to interpret their output appropriately for your use case. Perhaps you process tabs to mean "new table cell".
>> 
>> All that said, making the byte/column map machinery in TextDiagnostic more reusable would probably be a good thing all around. LLVM's diagnostics don't handle multibyte characters at all.
>> 
>> Jordan
>> 
>> 
>> On Mar 9, 2013, at 13:06 , Dimitri van Heesch <dimitri at stack.nl> wrote:
>> 
>>> Hi All,
>>> 
>>> I'm currently experimenting with improving doxygen's parsing capabilities by using the information from clang. 
>>> I'm using the libclang functions clang_tokenize and clang_annotateTokens to create hyperlinked and syntax highlighted 
>>> output for the source files processed by doxygen.
>>> 
>>> So far it works quite well, but when the source file contains a tab character this seems to be counted as one character, causing
>>> the output to be misaligned.
>>> 
>>> Is there some way to configure the number of spaces in a tab? or is there a way to replace tabs by spaces before sending the
>>> contents of a file to libclang, without first having to write the detabbed file to disk?
>>> 
>>> Any help is appreciated.
>>> 
>>> Regards,
>>> Dimitri
>>> 
>>> 
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>> 
>