[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or __FILE__ macro

Arthur O'Dwyer arthur.j.odwyer at gmail.com
Wed Sep 11 15:21:31 PDT 2013


On Wed, Sep 11, 2013 at 2:22 PM, Yunzhong Gao
<Yunzhong_Gao at playstation.sony.com> wrote:
> Arthur wrote:
>   > If #include directives will use UTF-8, then __FILE__ must also use UTF-8, so
>   > that this will work:
>   >
>   >     #include __FILE__
>   >
>   > And I would expect #line directives also to use UTF-8. The only good rationale
>   > I can imagine is that you're dealing with badly behaved third-party generators
>   > such as lex/yacc which dump malformed #line directives into the source file.
>   >
>   > The patch looks good to me, but the stated rationale is misleading; I don't
>   > think this patch helps with anything on a well-behaved system (even one
>   > where the filesystem charset is Shift-JIS). It merely helps Clang not-barf on
>   > malformed input (such as that produced by a badly behaved lex/yacc).
>
>   For some reason, your replies just won't appear in Phabricator while Eli's went
>   in just fine. Weird.

Phabricator requires you to sign in with your Facebook account, which
I don't particularly want to do, so all my replies are sent as email
messages instead of Phabricator comments.

>   I think, a UTF-8 encoded source file should not contain shift-jis encoded lines like this:
>   #include "こんにちは.c"

That's UTF-8! :D  But I take it you mean that the user's source file
should not look like this in "od -t x1":

0000000    23  69  6e  63  6c  75  64  65  20  22  82  b1  82  f1  82  c9
0000020    82  bf  82  cd  2e  63  22  0a
0000030

Instead, it should look like this:

0000000    23  69  6e  63  6c  75  64  65  20  22  e3  81  93  e3  82  93
0000020    e3  81  ab  e3  81  a1  e3  81  af  2e  63  22  0a
0000035

(That's the same Japanese text, simply encoded in UTF-8 instead of
Shift-JIS. We've already agreed that Clang expects all #include
directives to consist of UTF-8-encoded text.)

>   But it is okay to have lines like this:
>   #include "\202\261\202\361\202\311\202\277\202\315.c"

It's *okay* to have that line, but it doesn't mean what you think it
means. First, the backslashes are problematic (at least according to
the C++ standard); I don't actually know off the top of my head
whether this would try to open the "202" directory on Windows.
Secondly, that's not a valid filename according to the rules of
#include, which (as we've already agreed) expects all #include
directives to consist of UTF-8-encoded text.

>   You might be right that the current patch does not help the compiler find the included file

Well, then it shouldn't be pushed. Only patches that help should be pushed. :P

>   The equivalent UTF-8 encoded file name like the following might help the compiler find the file:
>   #include "\343\203\231\343\203\274\343\202\267\343\203\203\343\202\257.c"

If I were the programmer, I would simply write
>   #include "こんにちは.c"

This should work fine on all filesystems whose native character sets
encode those particular glyphs. UTF-8, UTF-16, Shift-JIS, EUC... all
should work fine. Translating between UTF-8 and local filesystem
encodings is a solved problem.

–Arthur




More information about the cfe-commits mailing list