[lldb-dev] [RFC] Segmented Address Space Support in LLDB

Mon Oct 26 15:45:02 PDT 2020

> On Oct 22, 2020, at 1:25 AM, Jason Molenda <jmolenda at apple.com> wrote:
> 
> Hi Greg, Pavel.
> 
> I think it's worth saying that this is very early in this project.  We know we're going to need the ability to track segments on addresses, but honestly a lot of the finer details aren't clear yet.  It's such a fundamental change that we wanted to start a discussion, even though I know it's hard to have detailed discussions still.
> 
> In the envisioned environment, there will be a default segment, and most addresses will be in the default segment.  DWARF, user input (lldb cmdline), SB API, and clang expressions are going to be the places where segments are specified --- Dump methods and ProcessGDBRemote will be the main place where the segments are displayed/used.  There will be modifications to the memory read/write gdb RSP packets to include these.
> 
> This early in the project, it's hard to tell what will be upstreamed to the llvm.org <http://llvm.org/> monorepo, or when.  My personal opinion is that we don't actually want to add segment support to llvm.org <http://llvm.org/> lldb at this point.  We'd be initializing every address object with LLDB_INVALID_SEGMENT or LLDB_DEFAULT_SEGMENT, and then testing that each object is initialized this way?  I don't see this actually being useful.
> 
> However, changing lldb's target addresses to be strictly handled in terms of objects will allow us to add a segment discriminator ivar to Address and ProcessAddress on our local branch while this is in development, and minimize the places where we're diverging from the llvm.org <http://llvm.org/> sources.  We'll need to have local modifications at the places where a segment is input (DWARF, cmdline, SB API, compiler type) or output (Dump, ProcesssGDBRemote) and, hopefully, the vast majority of lldb can be unmodified.
> 
> The proposal was written in terms of what we need to accomplish based on our current understanding for this project, but I think there will be a lot of details figured out as we get more concrete experience of how this all works.  And when it's appropriate to upstream to llvm.org <http://llvm.org/>, we'll be better prepared to discuss the tradeoffs of the approaches we took in extending Address/ProcessAddress to incorporate a segment.
> 
> My hope is that these generic OO'ification of target addresses will not change lldb beyond moving off of addr_t for now.
> 
> I included a couple of inlined comments, but I need to address more of yours & Pavel's notes later, I've been dealing with a few crazy things and am way behind on emails but didn't want to wait any longer to send something out.

No worries! I would vote to upstream as much as possible as soon as possible to avoid differences and merging issues for you guys. I would really like to see LLDB have support for segmented address spaces. Many comments I made were just my thinking out loud and trying to ease the changes in with as little disruption as possible.
> 
> 
> 
>> On Oct 19, 2020, at 4:11 PM, Greg Clayton via lldb-dev <lldb-dev at lists.llvm.org> wrote:
>> 
>> 
>> 
>>> On Oct 19, 2020, at 2:56 PM, Jonas Devlieghere via lldb-dev <lldb-dev at lists.llvm.org> wrote:
>>> 
>>> We want to support segmented address spaces in LLDB. Currently, all of LLDB’s external API, command line interface, and internals assume that an address in memory can be addressed unambiguously as an addr_t (aka uint64_t). To support a segmented address space we’d need to extend addr_t with a discriminator (an aspace_t) to uniquely identify a location in memory. This RFC outlines what would need to change and how we propose to do that.
>>> 
>>> ### Addresses in LLDB
>>> 
>>> Currently, LLDB has two ways of representing an address:
>>> 
>>> - Address object. Mostly represents addresses as Section+offset for a binary image loaded in the Target. An Address in this form can persist across executions, e.g. an address breakpoint in a binary image that loads at a different address every execution. An Address object can represent memory not mapped to a binary image. Heap, stack, jitted items, will all be represented as the uint64_t load address of the object, and cannot persist across multiple executions. You must have the Target object available to get the current load address of an Address object in the current process run. Some parts of lldb do not have a Target available to them, so they require that the Address can be devolved to an addr_t (aka uint64_t) and passed in.
>>> - The addr_t (aka uint64_t) type. Primarily used when receiving input (e.g. from a user on the command line) or when interacting with the inferior (reading/writing memory) for addresses that need not persist across runs. Also used when reading DWARF and in our symbol tables to represent file offset addresses, where the size of an Address object would be objectionable.
>>> 
>> 
>> Correction: LLDB has 3 kinds of uint64_t addresses:
>> - "file address" which are always mapped to a section + offset if put into a Address object. This value only makes sense to the lldb_private::Module that contains it. The only way to pass this around is as a lldb_private::Address. You can make queries on a file address using "image lookup --address" before you are debugging, but a single file address can result in multiple matches in multiple modules because each module might contain something at this virtual address. This object might be able to be converted to a "load address" if the section is loaded in your debug target. Since the target contains the section load list, the target is needed when converting between Address and addr_t objects.
>> - "load address" which is guaranteed to be unique in a process with no segments. It can always be put into a lldb_private::Address object, but that object won't always have a section. If there is no section, it means the memory location maps to stack, heap, or other memory that doesn't reside in a object file section. This object might be able to be converted to a section + offset address if the address matches one of the loaded sections in a target. If this can be converted to a Address object that has a section, then it can persist across debug sessions, otherwise, not.
>> - "host address" which is a pointer to memory in the LLDB process itself. Used for storing expression results and other things. You cannot convert this to/from a "file" or "load" address.
> 
> Yes, good point, host memory is a third type of address that we use.  And our symbols tables, for instance, internally represent themselves as uint64_t offsets into the file or section, I forget which, and we're not talking about changing those uint64_t style addresses.  On our project, I do not believe the symbol table will give us segment information.

You should be able to classify symbols from the symbol table to a segment though right?

We could add a special define for host addresses if needed.

> 
> 
> 
>> 
>>> ## Proposal
>>> 
>>> ### Address + ProcessAddress
>>> 
>>> - The Address object gains a segment discriminator member variable. Everything that creates an Address will need to provide this segment discriminator.
>> 
>> So an interesting thing to think about is if lldb_private::Section object should contain a segment identifier? If this is the case, then an Address object can have a Section that has a segment _and_ the Address object itself might have one that was set from the section as well. It would be good to figure out what the rules are for this case and it might lead to the need for an intelligent accessor that always prefers the section's segment if a section is available. The Address object must have one in case we have a pointer to memory in data and there is no section for this (like any heap addresses).
> 
> I don't believe a Section in this project will have a segment.  We're looking purely at individual variables, primarily from debug information.

So if you have a global variable, it will have a symbol right? And it will have debug info. Are you saying that only the debug info would have segment info? It seems important to be able to view a global variable without debug info.

> 
> 
>>> - A ProcessAddress object which is a uint64_t and a segment discriminator as a replacement for addr_t. ProcessAddress objects would not persist across multiple executions. Similar to how you can create an addr_t from an Address+Target today, you can create a ProcessAddress given an Address+Target. When we pass around addr_ts today, they would be replaced with ProcessAddress, with the exception of symbol tables where the added space would be significant, and we do not believe we need segment discriminators today.
>> 
>> Would SegmentedAddress be a more descriptive name here?
>> 
>> A few things I would like to see on ProcessAddress or SegmentedAddress:
>> - Have a segment definition that says "no segment" like LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT and allow these objects to be constructed with just a lldb::addr_t and the segment gets auto set to LLDB_NO_SEGMENT
> 
>> - Any code that uses these should test if there is no segment and continue to do what they used to do before
>> - like read/write memory in ProcessGDBRemote
> 
> 
> To be honest, testing this is going to be one of the tricky things I'm not sure how we'll do.  we will have a default segment that addresses will use unless overridden, but how we spot places that *incorrectly* failed to initialize the segment of an Address/ProcessAddress is something we're going to need to figure out.  

Tests I can think of:
- read function disassembly from a code segment that would have the same address as something from a data segment
- read a variable from a data segment that would have the same address as something from a code segment

It would be good to figure out where segments are going to come from. I would hope that some sections would be able to be mapped to certain segments so that we can live with a binary that has no debug info and still read say a global variable. I know we can do things right inside of the debug info since DWARF can have segment information.

> 
> 
>> - Anything that dumps one of these objects should dump just like they used to (just a uint64_t hex representation and no other notation)
>> - Add code that can convert a "load address" into a ProcessAddress or SegmentedAddress that invent the segment notation and have no changes for targets that don't support segmented address spaces
>> - 0x1000 should convert to ProcessAddress where the address is 0x1000 and segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT if the process doesn't support segmented addresses
> 
> 
>> - 0x1000 would return an error on conversion for processes that do support segmented addresses as the segment must be specified? Or should there be a default segment if we run into this case?
>> - Come up with some quick way to represent segmented addresses for an address of 0x1000 in segment 2: ideas:
>>   - [2]0x1000
>>   - {2}0x1000
>>   - 0x1000[2]
>>   - 0x1000{2}
>>   - {0x1000, 2}
> 
> To be honest, we haven't thought about the UI side of this very much yet.  I think there will be ABI or ArchSpec style information that maps segment numbers to human-understandable names.  It's ABI style enumerated numbers - the DWARF will include a number that is passed down to the remote gdb stub.

Should be fine to have named segments if needed, but we will need to come up with a way to specify a segment. We could always add new arguments to command line commands if needed.
> 
> 
>> 
>>> 
>>> ### Address Only
>>> 
>>> Extend the lldb_private::Address class to be the one representation of locations; including file based ones valid before running, file addresses resolved in a process, and process specific addresses (heap/stack/JIT code) that are only valid during a run. That is attractive because it would provide a uniform interface to any “where is something” question you would ask, either about symbols in files, variables in stack frames, etc.
>>> 
>>> At present, when we resolve a Section+Offset Address to a “load address” we provide a Target to the resolution API.  Providing the Target externally makes sense because a Target knows whether the Section is present or not and can unambiguously return a load address.    We could continue that approach since the Target always holds only one process, or extend it to allow passing in a Process when resolving non-file backed addresses.  But this would make the conversion from addr_t uses to Address uses more difficult, since we will have to push the Target or Process into all the API’s that make use of just an addr_t.  Using a single Address class seems less attractive when you have to provide an external entity to make sense of it at all the use sites.
>>> 
>>> We could improve this situation by including a Process (as a weak pointer) and fill that in on the boundaries where in the current code we go from an Address to a process specific addr_t.  That would make the conversion easier, but add complexity.  Since Addresses are ubiquitous, you won’t know what any given Address you’ve been handed actually contains.  It could even have been resolved for another process than the current one.  Making Address usage-dependent in this way reduces the attractiveness of the solution.
>>> 
>>> ## Approach
>>> 
>>> Replacing all the instances of addr_t by hand would be a lot of work. Therefore we propose writing a clang-based tool to automate this menial task. The tool would update function signatures and replace uses of addr_t inside those functions to get the addr_t from the ProcessAddress or Address and return the appropriate object for functions that currently return an addr_t. The goal of this tool is to generate one big NFC patch. This tool needs not be perfect, at some point it will be more work to improve the tool than fixing up the remaining code by hand. After this patch LLDB would still not really understand address spaces but it will have everything in place to support them.
>> 
>> This won't be NFC really as each location that plays with what used to be addr_t now must check if the segment is invalid before doing what it did before _and_ return an error if the segment is something valid.
>> 
>> It might be better to look at all of the APIs that could end up using a plain "addr_t" and adding new APIs that take a ProcessAddress and call the old API if the segment is LLDB_INVALID_SEGMENT or LLDB_NO_SEGMENT, and return an error if the segment is valid. For example in the Process class we have:
>> 
>> virtual size_t Process::DoReadMemory(lldb::addr_t vm_addr, void *buf, size_t size, Status &error) = 0;
>> 
>> We could add a new overload:
>> 
>> virtual size_t Process::DoReadMemory(ProcessAddress proc_addr, void *buf, size_t size, Status &error) {
>> if (proc_addr.GetSegment() == LLDB_NO_SEGMENT)
>>   return DoReadMemory(proc_addr.GetAddress(), but, size, error);
>> error.SetErrorString("segmented addresses are not supported on this process");
>> return 0
>> }
>> 
>> Then we can start modifying the locations that need to support segmented addresses as needed. For instance, if we were to add segmented address support to ProcessGDBRemote, then we would override this function in that class.
>> 
>> I am not sure if slowly adding this functionality is better than replacing this all right away, but we can't just do a global replace without adding functionality or error checking IMHO.
>> 
>> 
>>> Once all the APIs are updated, we can start working on the functional changes. This means actually interpreting the aspace_t values and making sure they don’t get dropped.
>>> 
>>> Finally, when all this work is done and we’re happy with the approach, we extend the SB API with overloads for the functions that currently take or return addr_t . I want to do this last so we have time to iterate before committing to a stable interface.
>> 
>> This might be one reason for doing the approach suggested above where we add new internal APIs that take a ProcessAddress and cut over to using them. As it would mean all of the current APIs in the lldb::SB layer would remain in place (they can't be removed) and would still make sense.
>> 
>>> 
>>> ## Testing
>>> 
>>> By splitting off the intrusive non-functional changes we are able to rely on the existing tests for coverage. Smaller functional changes can be tested in isolation, either through a unit test or a small GDB remote test. For end-to-end testing we can run the test suite with a modified debugserver that spoofs address spaces.
>> 
>> That makes sense. ProcessGDBRemote will need to dynamically respond with wether it supports segmented addresses by overloading the DoReadMemory that takes a ProcessAddress and do the right thing.
>> 
>> Thanks for taking this on. I hope some of the comments above help moving this forward.
>> 
>> Greg
>> 
>>> 
>>> Thanks,
>>> Jonas
>>> 
>>> _______________________________________________
>>> lldb-dev mailing list
>>> lldb-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>> 
>> _______________________________________________
>> lldb-dev mailing list
>> lldb-dev at lists.llvm.org <mailto:lldb-dev at lists.llvm.org>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20201026/2ed416fd/attachment-0001.html>