[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Tue Jan 30 08:16:38 PST 2018

Outside of Windows and macOS, filenames on most filesystems can be any
arbitrary sequence of bytes followed by NUL — there could be one encoding,
many encodings, or no valid encoding for the bytes in the filenames in a
given filesystem.

I'm supportive of requiring these filenames to be encoded in UTF-8, but I
believe currently Clang allows non-UTF-8 filenames in.

As part of this work, I'd advocate making it an error to try to compile (or
preprocess, or link, etc.) any file whose name is not encoded in UTF-8.

Ben

On Tue, Jan 30, 2018 at 8:33 AM Tom Honermann via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> On 1/29/2018 10:19 PM, Friedman, Eli via cfe-dev wrote:
> > File names do use a different charset, but LLVM has a file system layer
> > which abstracts that, so clang should pretend the filesystem is UTF-8.
> > (On Windows, we convert to UTF-16 before we call into the OS.  On other
> > systems, we currently just assume everything is UTF-8, but we could
> > change that if you need to run the compiler on a system where that
> > doesn't hold.)
>
> Any translation of file names is risky due to round-trip issues and
> well-formedness requirements.  For example, Shift-JIS defines many code
> points that don't round-trip through Unicode [1].  And most POSIX
> systems don't require file names to adhere to any particular encoding
> leaving proper interpretation dependent on locale settings.
>
> I don't know of particularly good solutions for these issues.  Things to
> think about:
> - Should file names in #include directives be transcoded with the source
>    file?  (I think so, though this will break attempts to compile, for
>    example, EBCDIC source files on systems where referenced file names
>    are stored as UTF-8; I'm not sure how gcc handles this).
> - How should file names on command lines be interpreted?  (I think
>    either no translation, or according to the current locale; on Windows,
>    the wide command line should be processed).
> - How should file names in environment variables be interpreted?  (I
>    think either no translation, or according to the current locale; on
>    Windows, the wide environment variable values should be processed).
>
> I think it is reasonable to not support all file names, but if so,
> limitations should be documented.  For starters, how file names are
> interpreted in various contexts (command line, env vars, #include,
> config files, etc...) should be documented.  Next, limitations such as
> no support for code points that don't round-trip through Unicode, or no
> support for file names that are not well-formed for the encoding they
> are interpreted with should be documented.
>
> Tom.
>
> [1]:
>
> https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180130/1d2b4dff/attachment.html>