[llvm-bugs] [Bug 45924] New: [MC] 32-bit ELF binaries end up with tons of unnamed symbols in the .debug_str section

via llvm-bugs llvm-bugs at lists.llvm.org
Thu May 14 00:42:08 PDT 2020


https://bugs.llvm.org/show_bug.cgi?id=45924

            Bug ID: 45924
           Summary: [MC] 32-bit ELF binaries end up with tons of unnamed
                    symbols in the .debug_str section
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: MC
          Assignee: unassignedbugs at nondot.org
          Reporter: mh+llvm at glandium.org
                CC: dmajor at mozilla.com, froydnj at gmail.com,
                    llvm-bugs at lists.llvm.org

Small STR:
- Create a `test.c` file containing `int main() { return 0; }`
- Compile with `clang -g -o test test.c -m32`
- Run `objdump -t test | grep debug_str`

The output looks like:
```
00000000 l    d  .debug_str     00000000              .debug_str
00000000 l       .debug_str     00000000              
00000015 l       .debug_str     00000000              
0000001c l       .debug_str     00000000              
00000021 l       .debug_str     00000000              
00000026 l       .debug_str     00000000              
```

For a library like libxul.so in Firefox, that yields more than 20 millions of
these, where the other symbols amount to about 360 thousands. This sadly makes
a check script take 2 minutes instead of a few seconds because that amount of
symbols.

This doesn't happen when not using the integrated assembler:
```
$ clang -g -o test test.c -m32 -fno-integrated-as`
$ objdump -t test | grep debug_str
00000000 l    d  .debug_str     00000000              .debug_str
```

It doesn't happen with lld either:
```
$ clang -g -o test test.c -m32 -fuse-ld=lld`
$ objdump -t test | grep debug_str
```

But the latter is because lld filters out symbols somehow (there are less
symbols overall than when linking with BFD ld or gold).

Other than the linker, there are actually two things at play here.

The first one is the difference between how the integrated assembler handles
.debug_str relocations compared to GNU as.
```
$ clang -g -o test.mc.o -c test.c -m32
$ clang -g -o test.gas.o -c test.c -m32 -fno-integrated-as
$ objdump -r -j .debug_info -s test.mc.o

test.mc.o:     file format elf32-i386

RELOCATION RECORDS FOR [.debug_info]:
OFFSET   TYPE              VALUE 
00000006 R_386_32          .debug_abbrev
0000000c R_386_32          
00000012 R_386_32          
00000016 R_386_32          .debug_line
0000001a R_386_32          
0000001e R_386_32          .text
00000027 R_386_32          .text
00000031 R_386_32          
0000003c R_386_32          


Contents of section .debug_info:
 0000 3f000000 04000000 00000401 00000000  ?...............
 0010 0c000000 00000000 00000000 00000000  ................
 0020 00001200 00000200 00000012 00000001  ................
 0030 55000000 0001013b 00000003 00000000  U......;........
 0040 050400                               ...             

$ objdump -r -j .debug_info -s test.gas.o
test.gas.o:     file format elf32-i386

RELOCATION RECORDS FOR [.debug_info]:
OFFSET   TYPE              VALUE 
00000006 R_386_32          .debug_abbrev
0000000c R_386_32          .debug_str
00000012 R_386_32          .debug_str
00000016 R_386_32          .debug_line
0000001a R_386_32          .debug_str
0000001e R_386_32          .text
00000027 R_386_32          .text
00000031 R_386_32          .debug_str
0000003c R_386_32          .debug_str


Contents of section .debug_info:
 0000 3f000000 04000000 00000401 00000000  ?...............
 0010 0c006500 00000000 00006c00 00000000  ..e.......l.....
 0020 00001200 00000200 00000012 00000001  ................
 0030 55710000 0001013b 00000003 76000000  Uq.....;....v...
 0040 050400                               ...             
```

Let's take one of the corresponding DWARF strings in the `objdump -W` output:
`<12>   DW_AT_name        : (indirect string, offset: 0x65): test.c)`

In the mc case, we use this relocation:
`00000012 R_386_32          `
which uses this symbol:
`00000065 l       .debug_str    00000000 `
and that fills the data at offset 0x12 in `.debug_info`, which is `00000000`.

In the gas case, we use this relocation:
`00000012 R_386_32          .debug_str`
which uses this symbol:
`00000000 l    d  .debug_str    00000000 .debug_str`
and that fills with the data at offset 0x12 in `.debug_info`, which is
`65000000`.

It's worth noting in the case of gas, there aren't any symbols for these
strings at all, contrary to the mc case.

This doesn't happen on 64-bit ELF because in that case, the addend from
Elf_Rela is being used instead of the contents of `.debug_info`, so none of the
unnamed symbols exist in the first place, like in the 32-bit gas case.

Because the unnamed symbols are used in relocations, they are not stripped off
like it happens on 64-bit ELF (via this code:
https://github.com/llvm/llvm-project/blob/d3530e95f1d4c97cf24e77c6db2d32ee5344d4ee/llvm/lib/MC/ELFObjectWriter.cpp#L638).

The second thing at play is that the symbol name from the original assembly
source is lost in translation.

```
$ clang -g -o - -S test.c -m32 | sed -n '/debug_str/,/section/p'
        .section        .debug_str,"MS", at progbits,1
.Linfo_string0:
        .asciz  "clang version 11.0.0 (https://github.com/llvm/llvm-project
79af7314fbde836854315ef7213076653076f20c)" # string offset=0
.Linfo_string1:
        .asciz  "test.c"                # string offset=101
.Linfo_string2:
        .asciz  "/tmp"                  # string offset=108
.Linfo_string3:
        .asciz  "main"                  # string offset=113
.Linfo_string4:
        .asciz  "int"                   # string offset=118
        .ident  "clang version 11.0.0 (https://github.com/llvm/llvm-project
79af7314fbde836854315ef7213076653076f20c)"
        .section        ".note.GNU-stack","", at progbits
```

The symbols that end up unnamed in the object file are those `.Linfo_string*`
symbols. They end up unnamed because of this commit:
https://github.com/llvm/llvm-project/commit/c177fec93f6676aa38686e24843f62ec0f8a7643

The code in that function changed in the meantime. The corresponding code in
master for the relevant part is now:
https://github.com/llvm/llvm-project/blob/d3530e95f1d4c97cf24e77c6db2d32ee5344d4ee/llvm/lib/MC/MCContext.cpp#L194

Commenting out this early return actually fixes it:
```
$ clang-patched -g -o test test.c -m32
$ objdump -t test | grep debug_str
00000000 l    d  .debug_str     00000000              .debug_str
```

The reason is that the intermediate object now has symbol names:
```
$ clang -g -o test.o -c test.c -m32
$ objdump -t test.o | grep debug_str
00000000 l       .debug_str     00000000 .Linfo_string0
00000065 l       .debug_str     00000000 .Linfo_string1
0000006c l       .debug_str     00000000 .Linfo_string2
00000071 l       .debug_str     00000000 .Linfo_string3
00000076 l       .debug_str     00000000 .Linfo_string4
```

and the linker apparently eliminates the symbols with a name starting with
`.L`.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200514/512725f2/attachment-0001.html>


More information about the llvm-bugs mailing list