[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or __FILE__ macro
Arthur O'Dwyer
arthur.j.odwyer at gmail.com
Wed Sep 11 13:35:02 PDT 2013
On Wed, Sep 11, 2013 at 12:07 PM, Gao, Yunzhong
<yunzhong_gao at playstation.sony.com> wrote:
>> > This is part of a bigger effort to support foreign characters in file names.
>
> This sentence is merely to give a background of this patch, which is only the second of six patches...
> I am particularly interested in supporting Japanese shift-jis encoding (windows code page 932)
> on Windows. On these systems, #include directives will use UTF-8 encoding but file names on
> command prompt will use shift-jis encoding. Both will be translated to UTF-16/unicode before
> making system calls to the underlying file system.
If #include directives will use UTF-8, then __FILE__ must also use
UTF-8, so that this will work:
#include __FILE__
And I would expect #line directives also to use UTF-8. The only good
rationale I can imagine is that you're dealing with badly behaved
third-party generators such as lex/yacc which dump malformed #line
directives into the source file.
The patch looks good to me, but the stated rationale is misleading; I
don't think this patch helps with anything on a well-behaved system
(even one where the filesystem charset is Shift-JIS). It merely helps
Clang not-barf on malformed input (such as that produced by a badly
behaved lex/yacc).
my $.02,
-Arthur
>> > When clang generates a preprocessed file, it saves current file name
>> > into a #line directive or __FILE__ macro. But if the file name
>> > contains escaped octal sequences, clang tries to convert the
>> > characters, which may be non-utf8, which then triggers diagnostics like this:
>> > ```
>> > /* test.c */
>> > #line 5 "\202\261\202\361\202\311\202\277\202\315.c"
>> > /* end of test.c */
>> > ```
>> > $ clang -S -save-temps test.c
>> > test.c:1:6: warning: illegal character encoding in string literal
>> > [-Winvalid-source-encoding]
>> >
>> > Clang does not really have to convert the characters; it could have
>> > just saved the escaped sequence in the preprocessed output. The
>> > proposed patch attempts to convert any non-printing characters to
>> > their corresponding escaped octal sequence before printing out. This
>> > is part of a bigger effort to support foreign characters in file names.
>> >
>> > Could someone take a look whether the proposed patch is good to go in?
>> >
>> > Many thanks,
>> > - Gao
>> >
>> > http://llvm-reviews.chandlerc.com/D1291
>> >
>> > Files:
>> > include/clang/Lex/Lexer.h
>> > lib/Frontend/PrintPreprocessedOutput.cpp
>> > lib/Lex/Lexer.cpp
>> > lib/Lex/PPMacroExpansion.cpp
>> > test/Preprocessor/line-directive-output.c
>> >
>> > Index: include/clang/Lex/Lexer.h
>> >
>> ==========================================================
>> =========
>> > --- include/clang/Lex/Lexer.h
>> > +++ include/clang/Lex/Lexer.h
>> > @@ -236,6 +236,11 @@
>> > /// and " characters. This does not add surrounding ""'s to the string.
>> > static void Stringify(SmallVectorImpl<char> &Str);
>> >
>> > + /// StringifyWithAddedEscape - Convert the specified string into a
>> > + C string /// and convert any non-printable characters to escaped
>> > + octal sequence. This /// does not add surrounding quotes to the string.
>> > + static void StringifyWithAddedEscape(SmallVectorImpl<char> &Str);
>> > +
>> >
>> > /// getSpelling - This method is used to get the spelling of a token into a
>> > /// preallocated buffer, instead of as an std::string. The caller
>> > is required
>> > Index: lib/Frontend/PrintPreprocessedOutput.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Frontend/PrintPreprocessedOutput.cpp
>> > +++ lib/Frontend/PrintPreprocessedOutput.cpp
>> > @@ -285,7 +285,7 @@
>> >
>> > CurFilename.clear();
>> > CurFilename += UserLoc.getFilename();
>> > - Lexer::Stringify(CurFilename);
>> > + Lexer::StringifyWithAddedEscape(CurFilename);
>> > FileType = NewFileType;
>> >
>> > if (DisableLineMarkers) {
>> > Index: lib/Lex/Lexer.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Lex/Lexer.cpp
>> > +++ lib/Lex/Lexer.cpp
>> > @@ -236,6 +236,41 @@
>> > }
>> > }
>> >
>> > +// Convert non-printing characters to escaped octal sequence.
>> > +static inline char toOctal(int X) { return (X&7)+'0'; }
>> > +
>> > +/// StringifyWithAddedEscape - Convert the specified string into a C
>> > +string and /// convert any non-printing characters to escaped octal
>> > +sequence. This does /// not add surrounding quotes to the string.
>> > +void Lexer::StringifyWithAddedEscape(SmallVectorImpl<char> &Str) {
>> > + for (unsigned i = 0, e = Str.size(); i != e; ++i) {
>> > + unsigned char C = Str[i];
>> > + if (C == '\\' || C == '"') {
>> > + Str.insert(Str.begin()+i, '\\');
>> > + ++i; ++e;
>> > + continue;
>> > + }
>> > +
>> > + if (isprint((unsigned)C))
>> > + continue;
>> > +
>> > + switch (C) {
>> > + case '\b': Str[i]='b'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > + case '\f': Str[i]='f'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > + case '\n': Str[i]='n'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > + case '\r': Str[i]='r'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > + case '\t': Str[i]='t'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > + default:
>> > + Str[i] = '\\';
>> > + Str.insert(Str.begin() + i + 1, toOctal(C >> 6));
>> > + Str.insert(Str.begin() + i + 2, toOctal(C >> 3));
>> > + Str.insert(Str.begin() + i + 3, toOctal(C >> 0));
>> > + i += 3; e += 3;
>> > + break;
>> > + }
>> > + }
>> > +}
>> > +
>> >
>> > //===-----------------------------------------------------------------
>> > -----===//
>> > // Token Spelling
>> >
>> > //===-----------------------------------------------------------------
>> > -----===//
>> > Index: lib/Lex/PPMacroExpansion.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Lex/PPMacroExpansion.cpp
>> > +++ lib/Lex/PPMacroExpansion.cpp
>> > @@ -1316,7 +1316,7 @@
>> > SmallString<128> FN;
>> > if (PLoc.isValid()) {
>> > FN += PLoc.getFilename();
>> > - Lexer::Stringify(FN);
>> > + Lexer::StringifyWithAddedEscape(FN);
>> > OS << '"' << FN.str() << '"';
>> > }
>> > Tok.setKind(tok::string_literal);
>> > Index: test/Preprocessor/line-directive-output.c
>> >
>> ==========================================================
>> =========
>> > --- test/Preprocessor/line-directive-output.c
>> > +++ test/Preprocessor/line-directive-output.c
>> > @@ -73,3 +73,8 @@
>> > # 42 "A.c"
>> > # 44 "A.c"
>> > # 49 "A.c"
>> > +
>> > +// CHECK: # 100 "\202\261\202\361\202\311\202\277\202\315.c"
>> > +// CHECK: filename = "\202\261\202\361\202\311\202\277\202\315.c";
>> > +# 100 "\202\261\202\361\202\311\202\277\202\315.c"
>> > +const char *filename = __FILE__;
>> >
>> > _______________________________________________
>> > cfe-commits mailing list
>> > cfe-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>> >
More information about the cfe-commits
mailing list