[llvm-dev] LLVM trunk generates different machine code for JCC instruction w/ or w/o debug info

Mon Jan 11 15:37:09 PST 2021

On Sun, Jan 3, 2021 at 6:11 PM Vedant Kumar <vsk at apple.com> wrote:
>
>
>
> On Dec 29, 2020, at 12:09 PM, Fangrui Song via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>
> On 2020-12-29, Neil Nelson via llvm-dev wrote:
>
> Bug 37728 - [meta] Make llvm passes debug info invariant
> https://bugs.llvm.org/show_bug.cgi?id=37728
>
> Further discussion on methods.
> https://groups.google.com/g/llvm-dev/c/yvbWr4azdh0/m/gy1tQIzIDwAJ
>
> Neil Nelson
>
>
> Thanks for the links:)
>
> On 12/29/20 7:25 AM, 陈志伟 via llvm-dev wrote:
>
> Hi folks, it’s my first post in llvm-dev mailing list, and definitely not the last :-)
>
> Recently, I found an elf file built with or without debug info has different machine code generated. Sadly, it cannot be reproduced in a piece of code. Here is my investigation.
>
> clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -ggdb3 -o dbg.ll
> clang -S -emit-llvm foo.cc <http://foo.cc> -O3 -o rel.ll
>
>
> Where foo.cc <http://foo.cc> is a cc file in my company of 10k+ LOC and depends on tons of 3rd libraries.
>
> The difference between dbg.ll and rel.ll are the llvm debug intrinsics. Emmmm, looks fine.
>
> llc dbg.ll -o dbg.s
> llc rel.ll -o rel.s
>
>
> And the asm instructions are the same. Emmm, fine again.
>
> llvm-mc -filetype=obj dbg.s -o dbg.o
> llvm-mc -filetype=obj rel.s -o rel.o
>
>
> The 2 obj files generated by LLVM assembler has DIFFERENT machine codes.
>
> 74 19                      je f20
>
>
> The obj compiled with debug info use 0x74 to represent a JE instruction, while
>
> 0f 84 15 00 00 00   je f20
>
>
> The obj compiled without debug info use 0x0f 0x84 instead.
>
> What? Why the debug info affects the generation of machine code? As a LLVM beginner, I’m willing to dive deeper to find the root cause.
>
> Thanks in advance.
>
>
>
> llvm.dbg.* are intrinsics (subset of Instruction).
>
> DbgInfoIntrinsic
>  DbgLabelInst
>  DbgVariableIntrinsic
>    DbgValueInst: llvm.dbg.value
>    DbgAddrIntrinsic: llvm.dbg.addr
>    DbgDeclareInst: llvm.dbg.declare (similar to llvm.dbg.addr, but not control-dependent)
>
> It is very easy to forget accounting for their existence in an optimization pass.
>
> for (Instruction &I : BB) {
>  if (isa<DbgInfoIntrinsic>(I))
>    continue;
>  ...
> }
>
> for (Instruction &I : instructions(F)) {
>  if (isa<DbgInfoIntrinsic>(I))
>    continue;
>  ...
> }
>
> If an optimization pass does not skip llvm.dbg.* but makes their occurrences affect its heuristics (for example, counting the number of instructions in a basic block), the transformation result may be different with and w/o llvm.dbg.*.
>
> GCC has -fcompare-debug and it seems that in the past they had fought diligently with the debug-affecting-codegen problems as well. (I am happy to take a stab at implementing it if others think it is mildly useful)
>
>
> It is not clear how serious the problem in LLVM is. If for example, the llvm-project codebase can be fixed relatively easily, we probably could add a built bot to detect new problems.
>
>
> Thanks for diving into this. Fwiw, we already have some tooling for identifying and investigating debug-affecting-codegen issues [1][2][3]. I'm not familiar with gcc's -fcompare-debug: while it could be better than what we've got, imho it makes sense to focus on addressing issues we already know about or can trivially detect. (To find lots more of these issues, simply build LNT [4] with the Os and Os-g profiles and diff the object files, or run [3] on the tests for your backend of choice.)
>
> To elaborate on [3] a bit: there appears to be a long tail of codegen difference bugs lurking around in the various backends, but not many (if any? -- it's been a while since I looked) at the IR level. I believe one of the root causes for this is that IR-level use-def chains ignore llvm.dbg.* uses by default (thanks to the ValueAsMetadata abstraction), while MIR-level use-def chains _include_ debug uses by default (see MachineRegisterInfo::use_*). It appears to be way too easy to write backend code that incorrectly assumes that debug uses are not there.
>
> I went on a bit of a spree trying to fix some of those issues in the AArch64 backend, starting with [5]. For a brief moment it was possible to add debug info to all the tests in test/CodeGen/AArch64 and still have all of them pass. Alas, that's no longer true. Adding a buildbot could help with this. It could also be valuable to change the MachineRegisterInfo default to ignore debug uses -- that's a larger change that would require a fair amount of community review and buy-in.
>
> [1] Object file level diffing: https://github.com/vedantk/scripts/blob/master/objdiff_driver.sh
> [2] IR-level debug-affecting-codegen detection: https://github.com/vedantk/scripts/blob/master/opt-check-dbg-invar.sh
> [3] MIR-level debug-affecting-codegen detection: https://llvm.org/docs/HowToUpdateDebugInfo.html#mutation-testing-for-mir-level-transformations (e.g. `llvm-lit test/CodeGen/AArch64 -Dllc="llc -debugify-and-strip-all-safe"`)
> [4] https://github.com/llvm/llvm-test-suite
> [5] https://reviews.llvm.org/rG5c04274dab4858180d756329d11499df247e9d2d
>
> vedant

Really appreciate the links:) I'll study them. A build bot will
definitely be helpful.

For Zhiwei's original problem (JCC + .p2align 4, 0x90) difference
(reported on https://bugs.llvm.org/show_bug.cgi?id=42138#c13),
I have found the root cause: an assembler optimization implemented in
X86AsmBackend.
I have attached more information on https://reviews.llvm.org/D75203#2491618

>
>
> Yes, reduce the source with some tools like creduce is important.
>
>
> With the new pass manager (-fno-legacy-pass-manager, which will hopefully become the default in the next release),
> you can dump changed IR with -print-changed, e.g.
>
>  clang -fno-legacy-pass-manager -mllvm -print-changed -S -O2 a.c 2> log
>
> This is usually more readable than -print-after-all.