[LLVMdev] Mangling of UTF-8 characters in symbol names

Fri Mar 30 18:22:51 PDT 2012

On Fri, Mar 30, 2012 at 6:17 PM, Sean Hunt <scshunt at csclub.uwaterloo.ca> wrote:
> On Fri, Mar 30, 2012 at 15:22, Eli Friedman <eli.friedman at gmail.com> wrote:
>>
>> On Fri, Mar 30, 2012 at 12:12 PM, Sean Hunt <scshunt at csclub.uwaterloo.ca>
>> wrote:
>> > Why is it that high (>127) bytes in symbol names get mangled by LLVM
>> > into
>> > _XX_, where XX is the hex representation of the character? Is this
>> > required
>> > by ELF or some similar standard? This behavior is inconsistent with GCC.
>>
>> I think it's just so that we have a way to actually write out the
>> symbol into the assembly file.  What does gcc do?
>>
>> -Eli
>>
>
> It emits the high bits literally. The consequence is that UTF-8-encoded
> identifiers come out in UTF-8:
>
> scshunt at natural-flavours:~$ gcc -fextended-identifiers -std=c99 -x c -c -o
> test.o -
> int i\u03bb;
> scshunt at natural-flavours:~$ nm test.o
> 00000004 C iλ
> scshunt at natural-flavours:~$
>
> As you can see, the nm output includes the literal lambda.

Okay... then we should probably support that as well.  Might need to
be a bit careful to make sure the assembly files work correctly.

-Eli