[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Tue Jan 30 07:32:52 PST 2018

On 1/29/2018 10:19 PM, Friedman, Eli via cfe-dev wrote:
> File names do use a different charset, but LLVM has a file system layer 
> which abstracts that, so clang should pretend the filesystem is UTF-8.  
> (On Windows, we convert to UTF-16 before we call into the OS.  On other 
> systems, we currently just assume everything is UTF-8, but we could 
> change that if you need to run the compiler on a system where that 
> doesn't hold.)

Any translation of file names is risky due to round-trip issues and 
well-formedness requirements.  For example, Shift-JIS defines many code 
points that don't round-trip through Unicode [1].  And most POSIX 
systems don't require file names to adhere to any particular encoding 
leaving proper interpretation dependent on locale settings.

I don't know of particularly good solutions for these issues.  Things to 
think about:
- Should file names in #include directives be transcoded with the source
   file?  (I think so, though this will break attempts to compile, for
   example, EBCDIC source files on systems where referenced file names
   are stored as UTF-8; I'm not sure how gcc handles this).
- How should file names on command lines be interpreted?  (I think
   either no translation, or according to the current locale; on Windows,
   the wide command line should be processed).
- How should file names in environment variables be interpreted?  (I
   think either no translation, or according to the current locale; on
   Windows, the wide environment variable values should be processed).

I think it is reasonable to not support all file names, but if so, 
limitations should be documented.  For starters, how file names are 
interpreted in various contexts (command line, env vars, #include, 
config files, etc...) should be documented.  Next, limitations such as 
no support for code points that don't round-trip through Unicode, or no 
support for file names that are not well-formed for the encoding they 
are interpreted with should be documented.

Tom.

[1]: 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode