[Lldb-commits] [PATCH] D140358: [lldb-vscode] Add support for disassembly view

Tue Jan 17 01:15:40 PST 2023

eloparco added inline comments.

================
Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177
+  const auto max_instruction_size = g_vsc.target.GetMaximumOpcodeByteSize();
+  const auto bytes_offset = -instruction_offset * max_instruction_size;
+  auto start_addr = base_addr - bytes_offset;
----------------
clayborg wrote:
> eloparco wrote:
> > clayborg wrote:
> > > Just checked out your changes, and you are still just subtracting a value from the start address and attempting to disassemble from memory which is the problem. We need to take that subtracted address, and look it up as suggested in previous code examples I posted. If you find a function to symbol, ask those objects for their instructions. and then try to use those. 
> > > 
> > > But basically for _any_ disassembly this is what I would recommend doing:
> > > - first resolve the "start_address" (no matter how you come up the address) that want to disassemble into a SBAddress
> > > - check its section. If the section is valid and contains instructions, call a function that will disassemble the address range for the section that starts at "start_address" and ends at the end of the section. We can call this "disassemble_code" as a function. More details on this below
> > > - If the section does not contain instructions, just read the bytes and emit a lines like:
> > > ```
> > > 0x1000 .byte 0x12
> > > 0x1000 .byte 0x34
> > > ...
> > > ```
> > > 
> > > Now for the disassemble_code function. We know the address range for this is in code. We then need to resolve the address passed to "disassemble_code" into a SBAddress and ask that address for a SBFunction or SBSymbol as I mentioned. Then we ask the SBFunction or SBSymbol for all instructions that they contain, and then use any instructions that fall into the range we have. If there is no SBFunction or SBSymbol, then disassemble an instruction at a time and then see if the new address will resolve to a function or symbol.
> > Tried my changes on a linux x86 machine and the loop `for (unsigned i = 0; i < max_instruction_size; i++) {` (L2190) takes care of the `start_address` possibly being in the middle of an instruction, so that's not a problem.  The problem I faced is that it tries to read too far from `base_addr` and the `ReadMemory()` operation returns few instructions (without reaching `base_addr`). That was not happening on my macOS M1 (arm) machine. 
> > 
> > To solve, I changed the loop at L2190 to
> > ```
> > for (unsigned i = 0; i < bytes_offset; i++) {
> >     auto sb_instructions =
> >         _get_instructions_from_memory(start_addr + i, disassemble_bytes);
> > ```
> > and if `start_addr` is in `sb_instructions` we're done and can exit the loop. That worked.
> > 
> > Another similar thing that can be done is to start from `start_sbaddr` as you were saying, increment the address until a valid section is found. Then call `_get_instructions_from_memory()` passing the section start.
> > What do you think? Delegating the disassembling to `ReadMemory()` + `GetInstructions()` looks simpler to me than to manually iterate over sections and get instructions from symbols and functions.
> > Is there any shortcoming I'm not seeing?
> so your 
> ```
> for (unsigned i = 0; i < max_instruction_size; i++) { 
> ```
> disassembles and tries to make sure you make it back to the original base address from the original disassemble packet? That can work but could a a bit time consuming?
> 
> The main issue, as you saw on x86, is you don't know what is in memory. You could have unreadable areas of memory when trying to disassemble. Also if you do have good memory that does contain instructions, there can be padding in between functions or even data between functions that the function accesses that can't be correctly disassembled and could throw things off again. 
> 
> The memory regions are the safest way to traverse memory to see what you have and would help you deal with holes in the memory. You can ask about a memory region with:
> ```
> lldb::SBError SBProcess::GetMemoryRegionInfo(lldb::addr_t load_addr, lldb::SBMemoryRegionInfo &region_info);
> ```
> If you ask about an invalid address, like zero on most platforms, it will return a valid "region_info" with the bounds of the unreadable address range filled in no read/write/execute permissions:
> ```
> (lldb) script
> Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D.
> >>> region = lldb.SBMemoryRegionInfo()
> >>> err = lldb.process.GetMemoryRegionInfo(0, region)
> [0x0000000000000000-0x0000000100000000 ---]
> ```
> So you could use the memory region info to your advantage here. If you have execute permissions, disassemble as instructions, and if you don't emit ".byte 0xXX" for each byte. If there are no permissions, you can emit some other string like "0x00000000: <invalid memory>". 
> 
> That being said, even when you do find an executable section of memory, there can be different stuff there even if a section _is_ executable. For instance, if we ask about the next memory region on a mac M1:
> ```
> >>> err = lldb.process.GetMemoryRegionInfo(0x0000000100000000, region)
> >>> print(region)
> [0x0000000100000000-0x0000000100004000 R-X]
> ```
> Notice that this is read + execute (you can access these via:
> ```
> region.IsReadable()
> region.IsWritable()
> region.IsExecutable()
> ```
> But at this address, this is the mach-o header which doesn't make sense to try and disassemble:
> ```
> (lldb) memory read 0x0000000100000000
> 0x100000000: cf fa ed fe 0c 00 00 01 00 00 00 00 02 00 00 00  ................
> 0x100000010: 11 00 00 00 18 03 00 00 85 00 20 00 00 00 00 00  .......... .....
> (lldb) memory read -fx -s4 -c 4 0x0000000100000000
> 0x100000000: 0xfeedfacf 0x0100000c 0x00000000 0x00000002
> ```
> 0xfeedfacf is the mach-o magic bytes for little endian 64 bit mach-o files.
> ```
> (lldb) disassemble --start-address 0x0000000100000000
> a.out`_mh_execute_header:
>     0x100000000 <+0>:  .long  0xfeedfacf                ; unknown opcode
>     0x100000004 <+4>:  .long  0x0100000c                ; unknown opcode
>     0x100000008 <+8>:  udf    #0x0
>     0x10000000c <+12>: udf    #0x2
>     0x100000010 <+16>: udf    #0x11
>     0x100000014 <+20>: udf    #0x318
>     0x100000018 <+24>: .long  0x00200085                ; unknown opcode
>     0x10000001c <+28>: udf    #0x0
> ```
> 
> So this is the main reason why I would suggest just disassembling using .byte or .long when we aren't in a function or symbol.
> 
understood, so in that case we also have to do the same for `_handle_disassemble_positive_offset`, because even there, when we use `ReadInstructions()`, we could have some unreadable areas in the middle and `ReadInstructions()` would stop prematurely when running into one of them, resulting in less instructions returned than expected

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140358/new/

https://reviews.llvm.org/D140358