<div dir="ltr">Outside of Windows and macOS, filenames on most filesystems can be any arbitrary sequence of bytes followed by NUL — there could be one encoding, many encodings, or no valid encoding for the bytes in the filenames in a given filesystem.<div><br></div><div>I'm supportive of requiring these filenames to be encoded in UTF-8, but I believe currently Clang allows non-UTF-8 filenames in.</div><div><br></div><div>As part of this work, I'd advocate making it an error to try to compile (or preprocess, or link, etc.) any file whose name is not encoded in UTF-8.</div><div><br></div><div>Ben</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Jan 30, 2018 at 8:33 AM Tom Honermann via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 1/29/2018 10:19 PM, Friedman, Eli via cfe-dev wrote:<br>

> File names do use a different charset, but LLVM has a file system layer<br>

> which abstracts that, so clang should pretend the filesystem is UTF-8.<br>

> (On Windows, we convert to UTF-16 before we call into the OS.  On other<br>

> systems, we currently just assume everything is UTF-8, but we could<br>

> change that if you need to run the compiler on a system where that<br>

> doesn't hold.)<br>

<br>

Any translation of file names is risky due to round-trip issues and<br>

well-formedness requirements.  For example, Shift-JIS defines many code<br>

points that don't round-trip through Unicode [1].  And most POSIX<br>

systems don't require file names to adhere to any particular encoding<br>

leaving proper interpretation dependent on locale settings.<br>

<br>

I don't know of particularly good solutions for these issues.  Things to<br>

think about:<br>

- Should file names in #include directives be transcoded with the source<br>

   file?  (I think so, though this will break attempts to compile, for<br>

   example, EBCDIC source files on systems where referenced file names<br>

   are stored as UTF-8; I'm not sure how gcc handles this).<br>

- How should file names on command lines be interpreted?  (I think<br>

   either no translation, or according to the current locale; on Windows,<br>

   the wide command line should be processed).<br>

- How should file names in environment variables be interpreted?  (I<br>

   think either no translation, or according to the current locale; on<br>

   Windows, the wide environment variable values should be processed).<br>

<br>

I think it is reasonable to not support all file names, but if so,<br>

limitations should be documented.  For starters, how file names are<br>

interpreted in various contexts (command line, env vars, #include,<br>

config files, etc...) should be documented.  Next, limitations such as<br>

no support for code points that don't round-trip through Unicode, or no<br>

support for file names that are not well-formed for the encoding they<br>

are interpreted with should be documented.<br>

<br>

Tom.<br>

<br>

[1]:<br>

<a href="https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode" rel="noreferrer" target="_blank">https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode</a><br>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div>