[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Fri Jan 6 01:50:21 PST 2023

eloparco added inline comments.

================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto bytes_offset = -instruction_offset * bytes_per_instruction;
+  auto start_addr = base_addr - bytes_offset;
+  const auto disassemble_bytes = instruction_count * bytes_per_instruction;
----------------
clayborg wrote:
> eloparco wrote:
> > clayborg wrote:
> > > This will work if the min and max opcode byte size are the same, like for arm64, the min and max are 4. This won't work for x86 or arm32 in thumb mode. So when backing up, we need to do an address lookup on the address we think we want to go to, and then adjust the starting address accordingly. Something like:
> > > 
> > > ```
> > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target);
> > > ```
> > > now we have a section offset address that can tell us more about what it is. We can find the SBFunction or SBSymbol for this address and use those to find the right instructions. This will allow us to correctly disassemble code bytes. 
> > > 
> > > We can also look at the section that the memory comes from and see what the section contains. If the section is data, then emit something like:
> > > ```
> > > 0x00001000 .byte 0x23
> > > 0x00001001 .byte 0x34
> > > ...
> > > ```
> > > To find the section type we can do:
> > > ```
> > > SBSection section = start_sbaddr.GetSection();
> > > if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) {
> > >  // Disassemble from a valid boundary
> > > } else {
> > >   // Emit a byte or long at a time with ".byte 0xXX" or other ASM directive for binary data
> > > }
> > > ```
> > > We need to ensure we start disassembling on the correct instruction boundary as well as our math for "start_addr" might be in between actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then we know we have instructions, and if we are not, then we can emit ".byte" or other binary data directives. So if we do have lldb::eSectionTypeCode as our section type, then we should have a function or symbol, and we can get instructions from those objects easily:
> > > 
> > > ```
> > > if (section.IsValid() && section.GetSectionType() == lldb::eSectionTypeCode) {
> > >  lldb::SBInstructionList instructions;
> > >  lldb::SBFunction function = start_sbaddr.GetFunction();
> > >  if (function.IsValid()) {
> > >     instructions = function.GetInstructions(g_vsc.target);
> > >  } else {
> > >     symbol = start_sbaddr.GetSymbol();
> > >     if (symbol.IsValid())
> > >       instructions = symbol.GetInstructions(g_vsc.target);
> > > }
> > > const size_t num_instrs = instructions.GetSize();
> > > if (num_instrs > 0) {
> > >   // we found instructions from a function or symbol and we need to 
> > >   // find the matching instruction that we want to start from by iterating
> > >   // over the instructions and finding the right address
> > >   size_t matching_idx = num_instrs; // Invalid index
> > >   for (size_t i=0; i<num_instrs; ++i) {
> > >     lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i);
> > >     if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) {
> > >       matching_idx = i;
> > >       break;
> > >     }
> > >   }
> > >   if (matching_idx < num_instrs) {
> > >     // Now we can print the instructions from [matching_idx, num_instrs)
> > >     // then we need to repeat the search for the next function or symbol. 
> > >     // note there may be bytes between functions or symbols which we can disassemble
> > >     // by calling _get_instructions_from_memory(...) but we must find the next
> > >     // symbol or function boundary and get back on track
> > >   }
> > >   
> > > ```
> > Sorry, I should have provided a proper explanation.
> > 
> > I use the maximum instruction size as the "worst case". Basically, I need to read a portion of memory but I do not know the start address and the size. For the start address, if I want to read N instructions before `base_addr` I need to read at least starting from `base_addr - N * max_instruction_size`: if all instructions are of size `max_instruction_size` I will read exactly N instructions; otherwise I will read more than N instructions and prune the additional ones afterwards. Same for applies for the size.
> > 
> > Since `start_addr` is based on a "worst case", it may be an address in the middle of an instruction. In that case, that first instruction will be misinterpreted, but I think that is negligible.
> > 
> > The logic is similar to what is implemented in other VS Code extensions, like https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134.
> > 
> > Does it make sense?
> The issue is, you might end up backing up by N bytes, and you might not end up on an opcode boundary. Lets say you have x86 disassembly like:
> 
> ```
>     0x100002e37 <+183>: 48 8b 4d f8              movq   -0x8(%rbp), %rcx
>     0x100002e3b <+187>: 48 39 c8                 cmpq   %rcx, %rax
>     0x100002e3e <+190>: 0f 85 09 00 00 00        jne    0x100002e4d               ; <+205> at main.cpp
>     0x100002e44 <+196>: 8b 45 94                 movl   -0x6c(%rbp), %eax
>     0x100002e47 <+199>: 48 83 c4 70              addq   $0x70, %rsp
>     0x100002e4b <+203>: 5d                       popq   %rbp
>     0x100002e4c <+204>: c3                       retq   
>     0x100002e4d <+205>: e8 7e 0f 00 00           callq  0x100003dd0               ; symbol stub for: __stack_chk_fail
>     0x100002e52 <+210>: 0f 0b                    ud2    
> ```
> 
> Let's say you started with the address 0x100002e4c, and backed up by the max opcode size of 15 for x86_64, that would be 0x100002e3d. You would start disassembling on a non opcode boundary as this is the last byte of the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly incorrectly. So we need  to make use of the functions or symbol boundaries to ensure we are disassembling correctly. If we have no function or symbol, we can do our best. But as you can see we would get things completely wrong in this case and we need to fix this as detailed.
Actually, I was mislead by the fact that so far I tried on both:
- my ARM machine, with instructions of fixed size (4)
- with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) disassembling to WASM, where, when the start address is in the middle of an instruction, only that first instruction is misinterpreted

Without going into sections and symbols as you were proposing, the solution can be as easy as done in this VS Code Extension: https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51.

I'll update the code and put some additional comments to clarify the logic. Let me know what you think.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140358/new/

https://reviews.llvm.org/D140358