<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 1/29/2018 5:21 PM, Sean Perry via

      cfe-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">

      <p><font size="2">Hi, I have been investigating how to

          implementation -finput-charset and -fexec-charset (including

          -fexec-wide-charset too). I noticed some discussions a couple

          years ago and was planning on taking the same approach. In a

          nutshell, change the source manager to convert the input

          charset into UTF-8 and do the parsing using UTF-8 (eg. in

          Lexer::InitLexer()). I would convert strings and character

          constants into the exec charset when a consumer asks for the

          string literal. This seems like a sound concept but there are

          many details that need to be ironed out. The clang data

          structure is filled with all kinds of strings (i.e file names,

          identifiers, literals). What charset should be used when

          creating the clang AST's? Should getName() return the name in

          UTF-8 or an output charset? </font><br>

      </p>

    </blockquote>

    <br>

    UTF-8; introducing another charset into the AST seems confusing for

    no benefit, given we're converting the input source code to UTF-8

    anyway.  The only place we need to translate symbol names to the

    execution charset is IR generation.<br>

    <br>

    <blockquote type="cite"

cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">

      <p><br>

        <font size="2">While looking into this I realized that we need

          one more charset. We have the input charset for the source

          code and exec charset for the string literals. And we have the

          an internal charset (UTF-8) for parsing. But we also need to

          have a charset for things like symbol names and file names.</font></p>

    </blockquote>

    <br>

    The charset for symbol names has to be the same as -fexec-charset

    for any target that has an API like dlsym().<br>

    <br>

    File names do use a different charset, but LLVM has a file system

    layer which abstracts that, so clang should pretend the filesystem

    is UTF-8.  (On Windows, we convert to UTF-16 before we call into the

    OS.  On other systems, we currently just assume everything is UTF-8,

    but we could change that if you need to run the compiler on a system

    where that doesn't hold.)<br>

    <blockquote type="cite"

cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">

      <p><font size="2">We also need to consider messages. The messages

          may not be in the same charset as the input charset or

          internal. We will need to consider translation for messages

          and the substituted text. </font><br>

      </p>

    </blockquote>

    <br>

    Messages should be UTF-8 until we have to convert them for output. 

    (An IDE always wants UTF-8.  For a console/stderr, we probably need

    some conversion, but IIRC that isn't implemented at the moment.)<br>

    <br>

    Proprocessed output needs to be in the input charset; otherwise the

    compiler can't consume the result (for example, -save-temps would

    break).<br>

    <br>

    -Eli<br>

    <pre class="moz-signature" cols="72">-- 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>

  </body>

</html>