[cfe-dev] tabs in the input and their effect on the column position
Jordan Rose
jordan_rose at apple.com
Tue Mar 12 18:33:58 PDT 2013
Adding an option to include whitespace tokens in clang_tokenize is a valid feature request. Since all we're doing is lexing (not parsing), Clang certainly has this capability; it's just not exposed. Please file a feature request at http://llvm.org/bugs/.
Just to warn you, if output is misaligned for tabs, it could potentially be misaligned for wide characters as well. But perhaps that's not important in your current use case; I'm glad you found a solution that works.
Jordan
On Mar 12, 2013, at 13:57 , Dimitri van Heesch <dimitri at stack.nl> wrote:
> Hi Jordan,
>
> Thanks for the info.
>
> It would have been nice if clang_tokenize would have included whitespace as tokens (if only as
> an option). Then one could reconstruct the output by purely looking at the tokens. Without it one
> needs to extract the whitespace between the tokens in the original file and analyse it to
> see how it should be rendered in the output. Doesn't sound very efficient nor convenient.
>
> Meanwhile I found out how to use CXUnsavedFile to pass a de-tabbed representation of the
> original file, which works for me. Note that in my case both input and output are UTF-8
> encoded, so the multi-byte characters are just passed through fine.
>
> Regards,
> Dimitri
>
> On Mar 10, 2013, at 8:42 , Jordan Rose <jordan_rose at apple.com> wrote:
>
>> I want to jump in before anyone starts suggesting solutions, and point out that all libclang output is in terms of bytes from the start of a line. That means that tabs show up as one byte, but it also means that if the source contains multibyte characters, you may have a range of three bytes referring to a single character (and a single column).
>>
>> Clang and libclang expect you to interpret their output appropriately for your use case. Perhaps you process tabs to mean "new table cell".
>>
>> All that said, making the byte/column map machinery in TextDiagnostic more reusable would probably be a good thing all around. LLVM's diagnostics don't handle multibyte characters at all.
>>
>> Jordan
>>
>>
>> On Mar 9, 2013, at 13:06 , Dimitri van Heesch <dimitri at stack.nl> wrote:
>>
>>> Hi All,
>>>
>>> I'm currently experimenting with improving doxygen's parsing capabilities by using the information from clang.
>>> I'm using the libclang functions clang_tokenize and clang_annotateTokens to create hyperlinked and syntax highlighted
>>> output for the source files processed by doxygen.
>>>
>>> So far it works quite well, but when the source file contains a tab character this seems to be counted as one character, causing
>>> the output to be misaligned.
>>>
>>> Is there some way to configure the number of spaces in a tab? or is there a way to replace tabs by spaces before sending the
>>> contents of a file to libclang, without first having to write the detabbed file to disk?
>>>
>>> Any help is appreciated.
>>>
>>> Regards,
>>> Dimitri
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>
More information about the cfe-dev
mailing list