[Lldb-commits] [PATCH] D71575: [LLDB] Add ObjectFileWasm plugin for WebAssembly debugging

Wed Jan 8 11:08:52 PST 2020

paolosev added inline comments.

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:49
+  lldb::offset_t initial_offset = *offset_ptr;
+  uint64_t value = section_header_data.GetULEB128(offset_ptr);
+  if (*offset_ptr == initial_offset || value > 127)
----------------
clayborg wrote:
> Is it ok if we consume more than 1 byte here?  What is the offset points to a larger ULEB, are we ok with advancing the offset by multiple bytes or should we back it up and return llvm::None?
> 
> This might be a good candidate to add to DataExtractor directly as:
> 
> ```
> /// Extract a ULEB128 number with a specified max value. If the extracted value exceeds 
> /// "max_value" the offset will be left unchanged and llvm::None will be returned.
> llvm::Optional<uint8_t> DataExtractor::GetULEB128(uint64_t *offset_ptr, uint64_t max_value);
> ```
> 
> There are many places where we extract a uint64_t, but only need a uint16_t (like in the DWARF parser where all DW_TAG_XXXX, DW_AT_XXX and DW_FORM_XXX values must only be uint16_t values but are encoded as ULEB128 values. So this could be used elsewhere if we do put it into DataExtractor.
Modified in DataExtractor, as suggested. However we need to return a `llvm::Optional<uint64_t>`, which can be not ideal because the result could need to be casted to another integer type.
We could also male `DataExtractor::GetULEB128` templatized on the integer type, getting `max_value` as `std::numeric_limits<int>::max()`, but I don't want to overly complicate the code.

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:162-163
+  // - the actual contents.
+  llvm::Optional<uint8_t> section_id =
+      GetVaruint7(section_header_data, &offset);
+  if (!section_id)
----------------
clayborg wrote:
> Is this a one byte section ID or is it a ULEB? Not sure why it would be encoded as a ULEB if it is always one byte? IF this really is just a one byte value, then replace with:
> 
> ```
> uint8_t section_id = data.GetU8(&offset);
> ```
You are right, it is a one-bye section. I cannot use `data.GetU8(&offset);` though, because it returns 0 on failure. 

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:342-343
+                        section_type,                  // Section type.
+                        sect_info.offset & 0xffffffff, // VM address.
+                        sect_info.size, // VM size in bytes of this section.
+                        sect_info.offset &
----------------
labath wrote:
> paolosev wrote:
> > labath wrote:
> > > Are the debug info sections actually loaded into memory for wasm? Should these be zero (that's what they are for elf)?
> > Yes, I was thinking that the debug sections should be loaded into memory; not sure how this works for ELF, how does the debugger find the debug info in that case?
> Are you referring to the load-from-memory, or load-from-file scenario? Normally, the debug info is not loaded into memory, and if lldb is loading the module from a file, then the debug info is loaded from the file (and we use the "file offset" field to locate them). Loading files from memory does not work in general (because we can't find the whole file there -- e.g., section headers are missing). The only case it does work is in case of jitted files. I'm not 100% sure how that works, but given that there is no `if(memory)` block in the section creation code, those sections must too get `vm size = 0`.
> 
> In practice, I don't think it does not matter that much what you put here -- I expect things will mostly work regardless. I am just trying to make this consistent in some way. If these sections are not found in the memory at the address which you set here (possibly adjusted by the load bias) then I think it makes sense to set `vm_size=vm_addr=0`. If they are indeed present there, then setting it to those values is perfectly fine.
Modified, but I am not sure I completely understood the difference of file addresses and vm addresses.

In the case of Wasm, we can have two cases:
a) Wasm module contains executable code and also DWARF data embedded in DWARF sections
b) Wasm module contains code and has a custom section that points to a separated Wasm module that only contains DWARF sections

The file of Wasm modules that contains code should never be loaded directly by LLDB; LLDB should use GDB-remote to connect to the Wasm engine that loaded the module and retrieve the module content from the engine.
But when the DWARF data has been stripped into a separate Wasm file, that file should be directly loaded by LLDB in order to load the debug sections.
So, if I am not wrong, only in the first case we should assume that the debug info is loaded into memory, and have vm_size != 0?

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:382
+
+  DecodeSections(load_address);
+
----------------
labath wrote:
> paolosev wrote:
> > labath wrote:
> > > This is strange.. I wouldn't expect that the section decoding logic should depend on the actual address that the object is loaded in memory. Can you explain the reasoning here?
> > There is a reason for this: DecodeNextSection() calls:
> > ```ReadImageData(offset, size)```
> > and when the debug info sections are embedded in the wasm module (not loaded from a separated symbol file), ReadImageData() calls Process::ReadMemory() that, for a GDB remote connection, goes to ProcessGDBRemote::DoReadMemory(). Here I pass the offset because it also represents the specific id of the module loaded in the target debuggee process, as explained in the comment above.
> What you say about ReadMemory makes sense, but it's not clear to me why you couldn't use the value in `m_memory_addr` for this. For in-memory object file this value should contain the actual address the object file was loaded from (which, I would expect, will include the `module_id` business), and so you wouldn't need the dynamic loader address in order to locate it. I believe this is how the other object file plugins do their in-memory loading..
Good point! Changed.

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:34
+/// Checks whether the data buffer starts with a valid Wasm module header.
+static bool ValidateModuleHeader(const DataBufferSP &data_sp) {
+  if (!data_sp || data_sp->GetByteSize() < kWasmHeaderSize)
----------------
clayborg wrote:
> I typically would put "data_sp" into a DataExtractor and extract a uint32_t and then check the decoded value with something like:
> 
> ```
> static bool ValidateModuleHeader(DataExtractor &data, uint64_t *offset_ptr) {
>   auto magic = data.GetU32(offset_ptr);
>   if (magic == WASM_MAGIC)
>     return true;
>   if (magic == WASM_CIGAM) {
>     // Set byte order in DataExtractor
>     data.SetByteOrder(data.GetByteOrder() == eByteOrderBig ? eByteOrderLittle : eByteOrderBig);
>     return true;
>   }
>   return false;
> }
> ```
> 
> This function expects a DataExtractor to be passed in that has "data_sp" inside of it with the host endian set as the byte order. It will set the byte order correctly. It also expects to have two uint32_t macros defined: WASM_MAGIC and WASM_CIGAM. These contain the non byte swapped and the byte swapped magic values. Easy to replace those with real definitions from else where (llvm::file_magic::wasm_object? Not sure of the type of this though, seemed like a StringRef).
Not sure about this... WebAssembly is always little-endian and a DataExtractor is not really needed here. 
I made sure to set the byte order where the DataExtractor is created in `ReadImageData`.

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:59
+/// Reads a LEB128 variable-length unsigned integer, limited to 32 bits.
+static llvm::Optional<uint32_t> GetVaruint32(DataExtractor &section_header_data,
+                                             lldb::offset_t *offset_ptr) {
----------------
clayborg wrote:
> remove if we add:
> ```
> llvm::Optional<uint8_t> DataExtractor::GetULEB128(uint64_t *offset_ptr, uint64_t max_value);
> ```
Actually, section_id is encoded as a byte not as a varuint7, so I modified the code accordingly, with `llvm::Optional<uint8_t> GetByte(DataExtractor &, lldb::offset_t *)`.
Even this could be moved to DataExtractor, maybe?

================
Comment at: lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp:187-195
+    const uint8_t *name_bytes = section_header_data.PeekData(offset, *name_len);
+    // If a custom section has a name longer than the allocated buffer or longer
+    // than the data left in the image, ignore this section.
+    if (!name_bytes)
+      return false;
+
+    llvm::StringRef sect_name(reinterpret_cast<const char *>(name_bytes),
----------------
clayborg wrote:
> All these lines can use the GetCStr with a length:
> ```
> ConstString sect_name(data.GetCStr(&offset, *name_len));
> if (!sect_name)
>   return false;
> ```
I had tried to use `DataExtractor::GetCStr()` but it doesn't work because it wants null-terminated strings, and this is not the case in Wasm, where strings are encoded as len (varuint32) followed by an array of len bytes that represent UTF8 chars (https://webassembly.github.io/spec/core/binary/values.html#names).

================
Comment at: lldb/test/Shell/ObjectFile/wasm/basic.yaml:21-79
+Sections:
+  - Type:            TYPE
+    Signatures:
+      - Index:           0
+        ParamTypes:
+          - I32
+        ReturnTypes:
----------------
labath wrote:
> Could you remove the sections which are not relevant for this test? This makes it easier to see the correspondence between the yaml and the test expectations..
Ok. What can be confusing is that Wasm modules (almost) always have at least a few standard sections (MEMORY, GLOBAL, CODE, ...) but these sections can be ignored by LLDB, so also by these tests.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71575/new/

https://reviews.llvm.org/D71575