[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)
Tom Honermann via cfe-dev
cfe-dev at lists.llvm.org
Tue Jan 30 07:32:52 PST 2018
On 1/29/2018 10:19 PM, Friedman, Eli via cfe-dev wrote:
> File names do use a different charset, but LLVM has a file system layer
> which abstracts that, so clang should pretend the filesystem is UTF-8.
> (On Windows, we convert to UTF-16 before we call into the OS. On other
> systems, we currently just assume everything is UTF-8, but we could
> change that if you need to run the compiler on a system where that
> doesn't hold.)
Any translation of file names is risky due to round-trip issues and
well-formedness requirements. For example, Shift-JIS defines many code
points that don't round-trip through Unicode [1]. And most POSIX
systems don't require file names to adhere to any particular encoding
leaving proper interpretation dependent on locale settings.
I don't know of particularly good solutions for these issues. Things to
think about:
- Should file names in #include directives be transcoded with the source
file? (I think so, though this will break attempts to compile, for
example, EBCDIC source files on systems where referenced file names
are stored as UTF-8; I'm not sure how gcc handles this).
- How should file names on command lines be interpreted? (I think
either no translation, or according to the current locale; on Windows,
the wide command line should be processed).
- How should file names in environment variables be interpreted? (I
think either no translation, or according to the current locale; on
Windows, the wide environment variable values should be processed).
I think it is reasonable to not support all file names, but if so,
limitations should be documented. For starters, how file names are
interpreted in various contexts (command line, env vars, #include,
config files, etc...) should be documented. Next, limitations such as
no support for code points that don't round-trip through Unicode, or no
support for file names that are not well-formed for the encoding they
are interpreted with should be documented.
Tom.
[1]:
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
More information about the cfe-dev
mailing list