[llvm-dev] DWARF .debug_aranges data objects and address spaces

Thu Mar 12 16:22:06 PDT 2020

On Thu, Mar 12, 2020 at 1:51 PM Robinson, Paul <paul.robinson at sony.com>
wrote:

> I’ve encountered this kind of architecture before, a long time ago
> (academically).    In a flat-address-space machine such as X64, there is
> still an instruction/data distinction, but usually only down at the level
> of I-cache versus D-cache (instruction fetch versus data fetch).  A Harvard
> architecture machine exposes that to the programmer, which effectively
> doubles the available address space.  Code and data live in different
> address spaces, although the address space identifier per se is not
> explicit.  A Move instruction would implicitly use the data address space,
> while an indirect Branch would implicitly target the code address space.
> An OS running on a Harvard architecture would require the loader to be
> privileged, so it can map data from an object file into the code address
> space and implement any necessary fixups.  Self-modifying code is at least
> wicked hard if not impossible to achieve.
>
>
>
> In DWARF this would indeed be described by a segment selector.  It’s up to
> the target ABI to specify what the segment selector numbers actually are.
> For a Harvard architecture machine this is pretty trivial, you say
> something like 0 for code and 1 for data.  Boom done.
>
>
>
> LLVM basically doesn’t have targets like this, or at least it has never
> come up before that I’m aware of.  So, when we emit DWARF, we assume a flat
> address space (unconditionally setting the segment selector size to zero),
> and llvm-dwarfdump will choke (hopefully cleanly, but still) on an object
> file that uses DWARF segment selectors.
>

FWIW Luke mentioned in the original email the AVR in-tree backend seems to
have this problem with an ambiguous debug_aranges entries.

>  The point of .debug_aranges is to accelerate the search for the
> appropriate CU.  Yes you can spend time trolling through .debug_info and
> .debug_abbrev, decoding the CU DIEs looking for low_pc/high_pc pairs (or
> perhaps pointers to .debug_ranges) and effectively rebuild a .debug_aranges
> section yourself, if the compiler/linker isn’t kind enough to pre-build the
> table for you.  I don’t understand why .debug_aranges should be
> discouraged; I shouldn’t think they would be huge, and consumers can avoid
> loading lots of data just to figure out what’s not worth looking at.
> Forcing all consumers to do things the slow way seems unnecessarily
> inefficient.
>

If the producer has put ranges on the CU it's not a lot of work - it's
parsing one DIE & looking for a couple of attributes. With Split DWARF the
cost of becomes a bit more prominent - Sema.o from clang, with split dwarf
(v4 or v5 about the same) is about 3.5% larger with debug aranges (not sure
about the overall data). It's enough at least at Google for us to not use
them & use CU ranges for the same purpose.

I thought I might be able to find some email history about why we turned it
off by default, but seems we never turned it /on/ by default to begin with
& it wasn't implemented until relatively late in the game (well, what I
think as relatively late - after I started on the project at least).

>  Thinking about Harvard architecture specifically, you **need** the
> segment selector only when an address could be ambiguous about whether it’s
> a code or data address.  This basically comes up **only** in
> .debug_aranges, he said thinking about it for about 30 seconds.  Within
> .debug_info you don’t need it because when you pick up the address of an
> entity, you know whether it’s for a code or data entity.  Location lists
> and range lists always point to code.  For .debug_aranges you would need
> the segment selector, but I think that’s the only place.
>
>
>
> For an architecture with multiple code or data segments, then you’d need
> the segment selector more widely, but I should think this case wouldn’t be
> all that difficult to make work.  Even factoring in the llvm-dwarfdump
> part, it has to understand the selector only for the .debug_aranges
> section; everything else can remain as it is, pretending there’s a flat
> address space.
>
>
>
> Now, if your target is downstream, that would make upstreaming the LLVM
> support a bit dicier, because we’d not want to have that feature in the
> upstream repo if there are no targets using it.  You’d be left maintaining
> that patch on your own.  But as I described above, I don’t think it would
> be a huge deal.
>
>
>
> HTH,
>
> --paulr
>
>
>
> *From:* David Blaikie <dblaikie at gmail.com>
> *Sent:* Thursday, March 12, 2020 2:20 PM
> *To:* Luke Drummond <luke.drummond at codeplay.com>; Adrian Prantl <
> aprantl at apple.com>; Jonas Devlieghere <jdevlieghere at apple.com>; Robinson,
> Paul <paul.robinson at sony.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] DWARF .debug_aranges data objects and address
> spaces
>
>
>
>
>
>
>
> On Thu, Mar 12, 2020 at 11:00 AM Luke Drummond <luke.drummond at codeplay.com>
> wrote:
>
> On Thu Mar 12, 2020 at 5:37 PM, David Blaikie wrote:
> > On Wed, Mar 11, 2020 at 8:09 AM Luke Drummond
> > <luke.drummond at codeplay.com>
> > wrote:
> >
> > > On Tue Mar 10, 2020 at 7:45 PM, David Blaikie wrote:
> > > > If you only want code addresses, why not use the CU's
> > > > low_pc/high_pc/ranges
> > > > - those are guaranteed to be only code addresses, I think?
> > > >
> > > In the common case, for most targets LLVM supports I think you're
> right,
> > > but for my case, regrettably, not. Because my target is a Harvard
> > > Architecture, any code address can have the same ordinal value as any
> > > data address: the code and data reside on different buses so the whole
> > > 4GiB space is available to both code, and data. `DW_AT_low_pc` and
> > > `DW_AT_high_pc` can be used to find the range of the code segment, but
> > > given an arbitrary address, cannot be used to conclusively determine
> > > whether that address belongs to code or data when both segments contain
> > > addresses in that numeric range.
> >
> >
> > Sorry I'm not following, partly probably due to my not having worked
> > with
> > such machines before.
> >
> > But how are the code addresses and data addresses differentiated then
> > (eg:
> > if you had segment selectors in debug_aranges, how would they be used?
> > The
> > addresses taken from the system at runtime have some kind of segment
> > selector associated with them, that you can then use to match with the
> > addr+segment selector in aranges?).
> Yes. This. The system mostly provides us the ability to disambiguate
> addresses because the device's simulator / debugger make this
> unambiguous, but the current .debug_aranges does not allow us to do this
> because it's missing such info.
> >
> > Actually, coming at it from a different angle: It sounds like in the
> > original email you're suggesting if debug_aranges did not contain data
> > addresses, this would be good/sufficient for you? So somehow you'd be
> > ensuring you only query debug_aranges using things you know are code
> > addresses, not data addresses? So why would the same solution/approach
> > not
> > hold to querying low/high/ranges on a CU that's already guaranteed not
> > to
> > contain data addresses?
> That's the root of the issue: the .debug_aranges section emitted by llvm
> *does* contain data addresses by default and therefore can be ambiguous.
> I've worked around this locally by hacking llvm to only emit aranges for
> text objects,
>
>
> Sorry, but I'm still not understanding why "aranges for only text objects"
> is more usable for your use case than "high/low/ranges on the CU"? Could
> you help me understand how those are different in your situation?
>
>
> but I was wandering if it's something that's valuable to
> fix upstream. My guess is that it's probably too niche to worry about
> for the moment, but if there's interest I can propose a design (probably
> a target hook to ask if segment selectors are required and how to get
> their number from an object).
>
>
> Added a few debug info folks in case they've got opinions. I don't really
> mind if we removed data objects from debug_aranges, though as you say, it's
> arguably correct/maybe useful as-is. Supporting it properly - probably
> using address segment selectors would be fine too, I guess AVR uses address
> spaces for its pointers to differentiate data and code addresses? In which
> case we could encode the LLVM address space as the segment selector (&
> probably would need to query the target to decide if it has non-zero
> address spaces and use that to decide whether to use segment selectors in
> debug_aranges)
>
> But in general, I'm mostly just discouraging people from using aranges -
> the data is duplicated in the CU's ranges anyway (there's some small
> caveats there - a producer doesn't /have/ to produce ranges on the CU, but
> I'd just say lower performance on such DWARF would be acceptable) & makes
> object files/executables larger for minimal value/mostly duplicate data.
>
> - Dave
>
>
>
> Thanks for your help
>
> Luke
>
> --
> Codeplay Software Ltd.
> Company registered in England and Wales, number: 04567874
> Registered office: Regent House, 316 Beulah Hill, London, SE19 3HF
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200312/d38af9ea/attachment.html>