[PATCH] [2/6] Convert non-printing characters to their octal sequence before emitting #line directive or __FILE__ macro
Arthur O'Dwyer
arthur.j.odwyer at gmail.com
Wed Sep 11 15:21:31 PDT 2013
On Wed, Sep 11, 2013 at 2:22 PM, Yunzhong Gao
<Yunzhong_Gao at playstation.sony.com> wrote:
> Arthur wrote:
> > If #include directives will use UTF-8, then __FILE__ must also use UTF-8, so
> > that this will work:
> >
> > #include __FILE__
> >
> > And I would expect #line directives also to use UTF-8. The only good rationale
> > I can imagine is that you're dealing with badly behaved third-party generators
> > such as lex/yacc which dump malformed #line directives into the source file.
> >
> > The patch looks good to me, but the stated rationale is misleading; I don't
> > think this patch helps with anything on a well-behaved system (even one
> > where the filesystem charset is Shift-JIS). It merely helps Clang not-barf on
> > malformed input (such as that produced by a badly behaved lex/yacc).
>
> For some reason, your replies just won't appear in Phabricator while Eli's went
> in just fine. Weird.
Phabricator requires you to sign in with your Facebook account, which
I don't particularly want to do, so all my replies are sent as email
messages instead of Phabricator comments.
> I think, a UTF-8 encoded source file should not contain shift-jis encoded lines like this:
> #include "こんにちは.c"
That's UTF-8! :D But I take it you mean that the user's source file
should not look like this in "od -t x1":
0000000 23 69 6e 63 6c 75 64 65 20 22 82 b1 82 f1 82 c9
0000020 82 bf 82 cd 2e 63 22 0a
0000030
Instead, it should look like this:
0000000 23 69 6e 63 6c 75 64 65 20 22 e3 81 93 e3 82 93
0000020 e3 81 ab e3 81 a1 e3 81 af 2e 63 22 0a
0000035
(That's the same Japanese text, simply encoded in UTF-8 instead of
Shift-JIS. We've already agreed that Clang expects all #include
directives to consist of UTF-8-encoded text.)
> But it is okay to have lines like this:
> #include "\202\261\202\361\202\311\202\277\202\315.c"
It's *okay* to have that line, but it doesn't mean what you think it
means. First, the backslashes are problematic (at least according to
the C++ standard); I don't actually know off the top of my head
whether this would try to open the "202" directory on Windows.
Secondly, that's not a valid filename according to the rules of
#include, which (as we've already agreed) expects all #include
directives to consist of UTF-8-encoded text.
> You might be right that the current patch does not help the compiler find the included file
Well, then it shouldn't be pushed. Only patches that help should be pushed. :P
> The equivalent UTF-8 encoded file name like the following might help the compiler find the file:
> #include "\343\203\231\343\203\274\343\202\267\343\203\203\343\202\257.c"
If I were the programmer, I would simply write
> #include "こんにちは.c"
This should work fine on all filesystems whose native character sets
encode those particular glyphs. UTF-8, UTF-16, Shift-JIS, EUC... all
should work fine. Translating between UTF-8 and local filesystem
encodings is a solved problem.
–Arthur
More information about the cfe-commits
mailing list