[Lldb-commits] [PATCH] D71575: [LLDB] Add ObjectFileWasm plugin for WebAssembly debugging

Tue Jan 14 05:26:15 PST 2020

labath added a comment.

Sorry for the delay. I was trying to figure out whether I want to get into the whole DataExtractor discussion or not -- I eventually did... :/

Besides that bit, I think this is looking good..

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:342-343
+                        section_type,                  // Section type.
+                        sect_info.offset & 0xffffffff, // VM address.
+                        sect_info.size, // VM size in bytes of this section.
+                        sect_info.offset &
----------------
paolosev wrote:
> labath wrote:
> > paolosev wrote:
> > > labath wrote:
> > > > paolosev wrote:
> > > > > labath wrote:
> > > > > > Are the debug info sections actually loaded into memory for wasm? Should these be zero (that's what they are for elf)?
> > > > > Yes, I was thinking that the debug sections should be loaded into memory; not sure how this works for ELF, how does the debugger find the debug info in that case?
> > > > Are you referring to the load-from-memory, or load-from-file scenario? Normally, the debug info is not loaded into memory, and if lldb is loading the module from a file, then the debug info is loaded from the file (and we use the "file offset" field to locate them). Loading files from memory does not work in general (because we can't find the whole file there -- e.g., section headers are missing). The only case it does work is in case of jitted files. I'm not 100% sure how that works, but given that there is no `if(memory)` block in the section creation code, those sections must too get `vm size = 0`.
> > > > 
> > > > In practice, I don't think it does not matter that much what you put here -- I expect things will mostly work regardless. I am just trying to make this consistent in some way. If these sections are not found in the memory at the address which you set here (possibly adjusted by the load bias) then I think it makes sense to set `vm_size=vm_addr=0`. If they are indeed present there, then setting it to those values is perfectly fine.
> > > Modified, but I am not sure I completely understood the difference of file addresses and vm addresses.
> > > 
> > > In the case of Wasm, we can have two cases:
> > > a) Wasm module contains executable code and also DWARF data embedded in DWARF sections
> > > b) Wasm module contains code and has a custom section that points to a separated Wasm module that only contains DWARF sections
> > > 
> > > The file of Wasm modules that contains code should never be loaded directly by LLDB; LLDB should use GDB-remote to connect to the Wasm engine that loaded the module and retrieve the module content from the engine.
> > > But when the DWARF data has been stripped into a separate Wasm file, that file should be directly loaded by LLDB in order to load the debug sections.
> > > So, if I am not wrong, only in the first case we should assume that the debug info is loaded into memory, and have vm_size != 0?
> > You're right, this is getting pretty confusing, as a lot of the concepts we're talking about here are overloaded (and not even consistently used within lldb). Let me try to elaborate here.
> > 
> > I'll start by describing various Section fields (I'll use ELF as a reference):
> > - file offset (`m_file_offset`) - where the section is physically located in the file. It does not matter if the file is loaded from inferior memory or not (in the former case, this is an offset from location of the first byte of the file). In elf this corresponds to the `sh_offset` field.
> > - file size (`m_file_size`) - size of the section in the file. Some sections don't take up any space in the file (`.bss`), so they have this zero. This is roughly the `sh_size` field in elf (but it is adjusted for SHT_NOBITS sections like .bss)
> > - file address (`m_file_addr`) - The address where the object file says this section should be loaded to. Note that this may not be the final address due to ASLR or such. It is the job of the SetLoadAddress function to compute the actual final address (which is then called the "load address"). This corresponds to the `sh_addr` field in elf, but it is also sometimes called the "vm address", because of the `p_vaddr` program header field
> > - vm size (`m_byte_size) size of the section when it gets loaded into memory. Sections which don't get loaded into memory have this as 0. This is also "rougly" `sh_size`, but only for SHF_ALLOC sections.
> > 
> > All of these fields are really geared for the case where lldb opens a file from disk, and then uses that to understand the contents of process memory. They become pretty redundant if you have loaded the file from memory, but it should still be possible to assign meaningful values to each of them.
> > 
> > The file offset+file size combo should reflect the locations of the sections within the file (regardless of whether it's a "real" file or just a chunk of memory). And the file address+vm size combo should reflect the locations of the sections in memory (modulo ASLR, and again independently of where lldb happened to read the file from).
> > 
> > Looking at the patch, I think you've got most of these right. The only question is, what is the actual memory layout of a wasm debug info in the inferior process. I originally highlighted this because I was assuming the typical elf model where the debug info is not loaded into memory and the debugger loads it from a file. (jitted files are an afterthought in elf, and not that well supported/tested). However, now I get the impression that the wasm debug info is actually always loaded into memory (assuming it is present, that is), and so setting the vm size to non zero might actually be correct here. You'll need to make the call there, as I don't know how wasm actually works. (Note that you needn't concern yourself much with separate debug files here -- the vm size only really matters once you call SetLoadAddress to actually mark the object file as loaded into the process, and we never call SetLoadAddress on a separate debug file).
> Thanks for the clarification!
> 
> WebAssembly modules are not really loaded into the memory of the inferior process like native executables are. The Wasm/JavaScript engine loads the module and executes it, either interpreting the Wasm bytecode or jit-compiling it into native code. Furthermore, in WebAssembly the code is distinct from the addressable memory. 
> All this poses some interesting problems to enable debugging with LLDB. 
> 
> My idea is that a Wasm engine that wants to support LLDB debugging should implement a GDB remote stub and LLDB would not directly access the inferior memory but only access it by talking with the engine through the remote protocol. From what concerns debug info, by default Clang embeds it in a few custom sections in the Wasm module itself. But normally these debug sections should be stripped into a separate Wasm file (we should modify llvm-objcopy for this) because they are not useful in the engine and take a lot of space.
> 
> Usually an engine loads and keeps the whole Wasm module into memory, so when the debug info is embedded in the module, we can say that the debug info is loaded into the inferior process but it’s only accessible through the engine. Here SetLoadAddress will be used to specify at which ‘virtual address’ the engine loaded the module; for example a module with id==4 will be loaded at address 0x00000004`00000000 (logic for this will be implemented in a new class DynamicLoaderWasm). LLDB will use gdb-remote commands like ‘m’ to read chunks of the debug sections from a loaded Wasm modules, and the remote stub identifies the module from the id encoded in the address.
> 
> When the debug info is in a separate file, as you say we don’t call SetLoadAddress on that file, and instead LLDB will use m_file_offset, m_file_size to read the debug directly from the file (m_file_offset, m_file_size can be 0). The code above should be consistent with this logic.
> 
Cool, thanks for expaining. I think we've both learned a lot here.

================
Comment at: lldb/source/Utility/DataExtractor.cpp:914
+/// Extract an unsigned LEB128 number with a specified max value. If the
+/// extracted value exceeds "max_value" the offset will be left unchanged and
+/// llvm::None will be returned.
----------------
It doesn't look like this actually happens, does it? (If max_value is exceeded, the offset will still be updated, right?).

And overall, I am not very happy with backdooring an api inconsistent with the rest of the DataExtractor (I am aware it was clayborg's idea). Overall, it would probably be better to use the llvm DataExtractor class, which has the Cursor interface designed to solve some of the problems you have here (it can handle EOF, it cannot check the uleb magnitude). And it tries to minimize the number of times you need to error check everything. The usage of it could be something like:
```
llvm::DataExtractor llvm_data = lldb_data.GetAsLLVM();
llvm::DataExtractor::Cursor c(0);
unsigned id = llvm_data.GetU8(c);
unsigned payload_len = llvm_data.GetULEB128(c);
if (!c)
  return c.takeError();
// id and payload_len are valid here
if (id == 0) {
  unsigned name_len = llvm_data.GetULEB128(c);
  SmallVector<uint8_t, 32> name_storage;
  llvm_data.GetU8(c, name_storage, name_len);
  if (!c)
    return c.takeError();
  // name_len and name valid here
  StringRef name = toStringRef(makeArrayRef(name_storage));
  unsigned section_length = ...;
  m_sect_infos.push_back(...)
}
```
This won't handle the uleb magnitude check, but these checks seem irrelevant and/or subsumable by other, more useful checks: a) Checking the name length is not necessary, as the code will fail for any names longer 1024 anyway (as that's the amount of data you read); b) instead of `section_len < 2^32` it seems more useful to check that `*offset_ptr + section_len` is less than `2^32`, to make sure we don't wrap the `module_id` part of the "address".

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71575/new/

https://reviews.llvm.org/D71575