[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or FILE macro

Wed Sep 11 13:35:02 PDT 2013

On Wed, Sep 11, 2013 at 12:07 PM, Gao, Yunzhong
<yunzhong_gao at playstation.sony.com> wrote:
>> > This is part of a bigger effort to support foreign characters in file names.
>
> This sentence is merely to give a background of this patch, which is only the second of six patches...
> I am particularly interested in supporting Japanese shift-jis encoding (windows code page 932)
> on Windows. On these systems, #include directives will use UTF-8 encoding but file names on
> command prompt will use shift-jis encoding. Both will be translated to UTF-16/unicode before
> making system calls to the underlying file system.

If #include directives will use UTF-8, then __FILE__ must also use
UTF-8, so that this will work:

    #include __FILE__

And I would expect #line directives also to use UTF-8. The only good
rationale I can imagine is that you're dealing with badly behaved
third-party generators such as lex/yacc which dump malformed #line
directives into the source file.

The patch looks good to me, but the stated rationale is misleading; I
don't think this patch helps with anything on a well-behaved system
(even one where the filesystem charset is Shift-JIS). It merely helps
Clang not-barf on malformed input (such as that produced by a badly
behaved lex/yacc).

my $.02,
-Arthur

>> > When clang generates a preprocessed file, it saves current file name
>> > into a #line directive or __FILE__ macro. But if the file name
>> > contains escaped octal sequences, clang tries to convert the
>> > characters, which may be non-utf8, which then triggers diagnostics like this:
>> > ```
>> > /* test.c */
>> > #line 5 "\202\261\202\361\202\311\202\277\202\315.c"
>> > /* end of test.c */
>> > ```
>> > $ clang -S -save-temps test.c
>> > test.c:1:6: warning: illegal character encoding in string literal
>> > [-Winvalid-source-encoding]
>> >
>> > Clang does not really have to convert the characters; it could have
>> > just saved the escaped sequence in the preprocessed output. The
>> > proposed patch attempts to convert any non-printing characters to
>> > their corresponding escaped octal sequence before printing out. This
>> > is part of a bigger effort to support foreign characters in file names.
>> >
>> > Could someone take a look whether the proposed patch is good to go in?
>> >
>> > Many thanks,
>> > - Gao
>> >
>> > http://llvm-reviews.chandlerc.com/D1291
>> >
>> > Files:
>> >   include/clang/Lex/Lexer.h
>> >   lib/Frontend/PrintPreprocessedOutput.cpp
>> >   lib/Lex/Lexer.cpp
>> >   lib/Lex/PPMacroExpansion.cpp
>> >   test/Preprocessor/line-directive-output.c
>> >
>> > Index: include/clang/Lex/Lexer.h
>> >
>> ==========================================================
>> =========
>> > --- include/clang/Lex/Lexer.h
>> > +++ include/clang/Lex/Lexer.h
>> > @@ -236,6 +236,11 @@
>> >    /// and " characters.  This does not add surrounding ""'s to the string.
>> >    static void Stringify(SmallVectorImpl<char> &Str);
>> >
>> > +  /// StringifyWithAddedEscape - Convert the specified string into a
>> > + C string  /// and convert any non-printable characters to escaped
>> > + octal sequence. This  /// does not add surrounding quotes to the string.
>> > +  static void StringifyWithAddedEscape(SmallVectorImpl<char> &Str);
>> > +
>> >
>> >    /// getSpelling - This method is used to get the spelling of a token into a
>> >    /// preallocated buffer, instead of as an std::string.  The caller
>> > is required
>> > Index: lib/Frontend/PrintPreprocessedOutput.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Frontend/PrintPreprocessedOutput.cpp
>> > +++ lib/Frontend/PrintPreprocessedOutput.cpp
>> > @@ -285,7 +285,7 @@
>> >
>> >    CurFilename.clear();
>> >    CurFilename += UserLoc.getFilename();
>> > -  Lexer::Stringify(CurFilename);
>> > +  Lexer::StringifyWithAddedEscape(CurFilename);
>> >    FileType = NewFileType;
>> >
>> >    if (DisableLineMarkers) {
>> > Index: lib/Lex/Lexer.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Lex/Lexer.cpp
>> > +++ lib/Lex/Lexer.cpp
>> > @@ -236,6 +236,41 @@
>> >    }
>> >  }
>> >
>> > +// Convert non-printing characters to escaped octal sequence.
>> > +static inline char toOctal(int X) { return (X&7)+'0'; }
>> > +
>> > +/// StringifyWithAddedEscape - Convert the specified string into a C
>> > +string and /// convert any non-printing characters to escaped octal
>> > +sequence. This does /// not add surrounding quotes to the string.
>> > +void Lexer::StringifyWithAddedEscape(SmallVectorImpl<char> &Str) {
>> > +  for (unsigned i = 0, e = Str.size(); i != e; ++i) {
>> > +    unsigned char C = Str[i];
>> > +    if (C == '\\' || C == '"') {
>> > +      Str.insert(Str.begin()+i, '\\');
>> > +      ++i; ++e;
>> > +      continue;
>> > +    }
>> > +
>> > +    if (isprint((unsigned)C))
>> > +      continue;
>> > +
>> > +    switch (C) {
>> > +      case '\b': Str[i]='b'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > +      case '\f': Str[i]='f'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > +      case '\n': Str[i]='n'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > +      case '\r': Str[i]='r'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > +      case '\t': Str[i]='t'; Str.insert(Str.begin()+i, '\\'); ++i; ++e; break;
>> > +      default:
>> > +        Str[i] = '\\';
>> > +        Str.insert(Str.begin() + i + 1, toOctal(C >> 6));
>> > +        Str.insert(Str.begin() + i + 2, toOctal(C >> 3));
>> > +        Str.insert(Str.begin() + i + 3, toOctal(C >> 0));
>> > +        i += 3; e += 3;
>> > +        break;
>> > +    }
>> > +  }
>> > +}
>> > +
>> >
>> > //===-----------------------------------------------------------------
>> > -----===//
>> >  // Token Spelling
>> >
>> > //===-----------------------------------------------------------------
>> > -----===//
>> > Index: lib/Lex/PPMacroExpansion.cpp
>> >
>> ==========================================================
>> =========
>> > --- lib/Lex/PPMacroExpansion.cpp
>> > +++ lib/Lex/PPMacroExpansion.cpp
>> > @@ -1316,7 +1316,7 @@
>> >      SmallString<128> FN;
>> >      if (PLoc.isValid()) {
>> >        FN += PLoc.getFilename();
>> > -      Lexer::Stringify(FN);
>> > +      Lexer::StringifyWithAddedEscape(FN);
>> >        OS << '"' << FN.str() << '"';
>> >      }
>> >      Tok.setKind(tok::string_literal);
>> > Index: test/Preprocessor/line-directive-output.c
>> >
>> ==========================================================
>> =========
>> > --- test/Preprocessor/line-directive-output.c
>> > +++ test/Preprocessor/line-directive-output.c
>> > @@ -73,3 +73,8 @@
>> >  # 42 "A.c"
>> >  # 44 "A.c"
>> >  # 49 "A.c"
>> > +
>> > +// CHECK: # 100 "\202\261\202\361\202\311\202\277\202\315.c"
>> > +// CHECK: filename = "\202\261\202\361\202\311\202\277\202\315.c";
>> > +# 100 "\202\261\202\361\202\311\202\277\202\315.c"
>> > +const char *filename = __FILE__;
>> >
>> > _______________________________________________
>> > cfe-commits mailing list
>> > cfe-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>> >

[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or __FILE__ macro

[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or FILE macro