[Lldb-commits] [PATCH] D113098: [lldb] (Partially) enable formatting of utf strings before the program is started

Mon Nov 15 17:00:43 PST 2021

jingham requested changes to this revision.
jingham added a comment.
This revision now requires changes to proceed.

It makes sense for the string formatters to try reading from the target if there's no process.  So the general idea is fine.

But one of the other major differences between the Target before a process has been launched and attached to; and its state after attachment; is that before running there isn't a unique mapping from a scalar address (lldb::addr_t) to a specific byte in a specific Object File.  For instance, on the systems where lldb can figure it out the dependent libraries before execution, lldb will determine & load into the Target not just the binary handed to it by the "file" command, but also the closure of that binary's direct dependents.  That's handy, it means that you can inspect disassembly, set breakpoints, etc. seeing all the libraries your binary depends on before launch.  However, on many systems libraries aren't given unique load addresses.  They generally just get set to load at 0x0, and the loader is the one that gives them all unique slots in the memory space when the process runs.  So Modules in the Target might and often do overlap each other in the "before running" address space.

When you are asking a process for memory, there's only one value at any given address, so all those API's can take an lldb::addr_t, the Process will always be able to unambiguously turn that back into either a plain address (if the memory is on the heap or stack) or a section relative address if the address is in TEXT/DATA, etc. of some loaded binary.  But when you start asking the Target questions, you can't rely on this fact.

The lldb_private::Address class handles that by recording the address as section in a Module + offset, inherently picking the right binary.  So using Addresses we can always record the user's intentions correctly.  For instance, if you get the Address from a ValueObject representing a global variable which is in the DATA section of a target, that Address will record both the data section, and the variable's offset in that sections.   But for that to work, you have to be careful to preserve the Address throughout the process, and never go back & forth from lldb_private::Address to lldb::addr_t until you actually have to request bytes of something (process, file, etc).  Otherwise you might start out trying to read a string in library A, but end up reading the memory for it from library B...

Preserving the Address as long as is practicable is actually a general good practice in lldb.  Even if you have a live process, when you go from section-relative Address to addr_t and back to Address, we have to re-lookup that address in the section load list, which is inefficient.

You make this mistake a few places in this patch, which I've pointed out in the detailed comments.

================
Comment at: lldb/include/lldb/Target/Target.h:1028
+  ///
+  /// This function will read a cache page at a time until a NULL string
+  /// terminator is found. It will stop reading if an aligned sequence of NULL
----------------
ReadCStringFromMemory (both flavors) set "force_live_memory" to be true.  Note that doesn't mean "don't read from the target" it means "always read from the process first."  It does fall back to the Target if there's no process, AND if your Address is section relative.

OTOH, you make it a default argument and then set it to "false" - the opposite behavior.  Is there a reason for that?  That seems confusing.

force_live_memory is really an efficiency thing.  It will only make a difference if the data to be read is in a read-only section of a binary, in which case it will read from there.  lldb always used to prefer reading locally if it can, but if someone has been able to flip permissions and overwrite read-only DATA through some kind of attack, you wouldn't be able to see that if this is the default.  So in actual fact, we pass force_live_memory -> true most of the places we use these Target Memory Reading API's.

So it seems better to follow ReadCStringFromMemory and just force this to true.

================
Comment at: lldb/source/Plugins/Language/CPlusPlus/CxxStringTypes.cpp:58
   StringPrinter::ReadStringAndDumpToStreamOptions options(valobj);
   options.SetLocation(valobj_addr);
+  options.SetTargetSP(valobj.GetTargetSP());
----------------
This should take some form of lldb_private::Address not an lldb::addr_t or it won't work pre-run.  See the overall comment for details.

================
Comment at: lldb/source/Target/Target.cpp:1911
+  size_t total_bytes_read = 0;
+  if (dst && max_bytes && type_width && max_bytes >= type_width) {
+    // Ensure a null terminator independent of the number of bytes that is
----------------
Process::ReadStringFromMemory is a pretty old function, and hasn't been fully converted to llvm style, which prefers to check things like `dst != nullptr` up front and do an early return if anything fails.  That reduces indentation, and spaces are a precious commodity if you restrict yourself to 80 characters on a line.

Also, the error message was originally pretty useless, but there's no reason to propagate that now that you've touched it ;-)

================
Comment at: lldb/source/Target/Target.cpp:1922
+
+    addr_t curr_addr = addr.GetLoadAddress(this);
+    Address address = addr;
----------------
This is wrong.  If the incoming address was section relative, you've thrown that information away.  Just pass Target::ReadMemory the full Address (addr), it will know what to do with it.  And you can use:

addr.Slide(bytes_read);

instead of lines 1951-1952 to update the address.

================
Comment at: lldb/source/Target/Target.cpp:1924
+    Address address = addr;
+    const size_t cache_line_size = 512;
+    char *curr_dst = dst;
----------------
This is a seemingly arbitrary choice.  

>From what I can tell you are copying this from GetCStringFromMemory, so the decision to do this isn't yours...

IIUC the whole point of this paging so we don't  read a big chunk of memory and then find the terminator a couple of characters in.  The optimal size for that doesn't have to line up with the process cache size, but I don't see why you wouldn't use that if it was handy.

 Anyway, having this here with no comment is confusing.  Either repeat the original justification or get the process cache line size if there's a process.

================
Comment at: lldb/test/API/lang/cpp/char8_t/TestCxxChar8_t.py:25
+        # Make sure the variables can be printed without a running process.
+        self.expect("target variable a", substrs=["char8_t", "0x61 u8'a'"])
+        self.expect("target variable ab",
----------------
It's a shame we don't have SBTarget.GetValueForVariablePath() - the equivalent of SBFrame.GetValueForVariablePath.  If we did we could make a global variable version of lldbtest.expect_var_path, which would make this test more precise.  But we don't...

================
Comment at: lldb/test/API/lang/cpp/char8_t/TestCxxChar8_t.py:29
+        self.expect("target variable abc", substrs=["char8_t[9]", 'u8"你好"'])
+        # TODO: Make this work through expression evaluation as well
+
----------------
Rather than a TODO, it's better to put in a test here, then expect it to fail.  That way if someone comes upon this comment later on, they won't have to guess what didn't work for you.  And if somebody fixes this while doing something else, the unexpected pass will let them know.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113098/new/

https://reviews.llvm.org/D113098