<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 28.10.2020 01:49, David Blaikie
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAENS6EsbwUnre9DFtq-Xs=QFMyr-6BTgOGwcOCX-RPFFdFDTLQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Oct 27, 2020 at
12:34 PM Alexey Lapshin <<a
href="mailto:avl.lapshin@gmail.com" moz-do-not-send="true">avl.lapshin@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p><br>
</p>
<div>On 27.10.2020 20:32, David Blaikie wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Oct 27,
2020 at 1:23 AM Alexey Lapshin <<a
href="mailto:avl.lapshin@gmail.com"
target="_blank" moz-do-not-send="true">avl.lapshin@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p><br>
</p>
<div>On 26.10.2020 22:38, David Blaikie wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sun,
Oct 25, 2020 at 9:31 AM Alexey Lapshin
<<a
href="mailto:avl.lapshin@gmail.com"
target="_blank" moz-do-not-send="true">avl.lapshin@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p><br>
</p>
<div>On 23.10.2020 19:43, David
Blaikie wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote
class="gmail_quote"
style="margin:0px 0px
0px
0.8ex;border-left:1px
solid
rgb(204,204,204);padding-left:1ex">
<div><br>
<br>
</div>
</blockquote>
<div><br>
</div>
<div>Ah, yeah - that
seems like a missed
opportunity -
duplicating the whole
type DIE. LTO does
this by making
monolithic types -
merging all the
members from different
definitions of the
same type into one,
but that's maybe too
expensive for dsymutil
(might still be
interesting to know
how much more
expensive, etc). But I
think the other way to
go would be to produce
a declaration of the
type, with the
relevant members - and
let the DWARF consumer
identify this
declaration as
matching up with the
earlier definition.
That's the sort of
DWARF you get from the
non-MachO default
-fno-standalone-debug
anyway, so it's
already pretty well
tested/supported
(support in lldb's a
bit younger/more
work-in-progress,
admittedly). I wonder
how much dsym size
there is that could be
reduced by such an
implementation.</div>
</div>
</div>
</blockquote>
<p>I see. Yes, that could be
done and I think it would
result in noticeable size
reduction(I do not know
exact numbers at the
moment).</p>
<p>I work on multi-thread
DWARFLinker now and it`s
first version will do
exactly the same type
processing like current
dsymutil.</p>
</blockquote>
<div>Yeah, best to keep the
behavior the same through that</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Above scheme could be
implemented as a next step
and it would result in
better size
reduction(better than
current state).</p>
<p>But I think the better
scheme could be done also
and it would result in
even bigger size reduction
and in faster execution.
This scheme is something
similar to what you`ve
described above: "LTO does
- making monolithic types
- merging all the members
from different definitions
of the same type into
one".</p>
</div>
</blockquote>
<div>I believe the reason that's
probably not been done is that
it can't be streamed - it'd
lead to buffering more of the
output </div>
</div>
</div>
</blockquote>
<p>yes. The fact that DWARF should be
streamed into AsmPrinter complicates
parallel dwarf generation. In my
prototype, I generate <br>
several resulting files(each for one
source compilation unit) and then
sequentially glue them into the
final resulting file.<br>
</p>
</div>
</blockquote>
<div>How does that help? Do you use
relocations in those intermediate object
files so the DWARF in them can refer
across files? <br>
</div>
</div>
</div>
</blockquote>
<p>It does not help with referring across the
file. It helps to parallel the generation of
CU bodies. <br>
It is not possible to write two CUs in
parallel into AsmPrinter. To make possible
parallel generation I stream them into
different AsmPrinters(this comment is for "I
believe the reason that's probably not been
done is that it can't be streamed". which
initially was about referring across the file,
but it seems I added another direction).<br>
</p>
</div>
</blockquote>
<div>Oh, I see - thanks for explaining, essentially
buffering on-disk. <br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> </p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> </p>
<p><br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div>(if two of these expandable
types were in one CU - the
start of the second type
couldn't be known until the
end because it might keep
getting pushed later due to
expansion of the first type)
and/or having to revisit all
the type references (the
offset to the second type
wouldn't be known until the
end - so writing the offsets
to refer to the type would
have to be deferred until
then).<br>
</div>
</div>
</div>
</blockquote>
<p>That is the second problem: offsets
are not known until the end of file.<br>
dsymutil already has that situation
for inter-CU references, so it has
extra pass to<br>
fixup offsets. </p>
</div>
</blockquote>
<div>Oh, it does? I figured it was
one-pass, and that it only ever refers
back to types in previous CUs? So it
doesn't have to go back and do a second
pass. But I guess if sees a declaration
of T1 in CU1, then later on sees a
definition of T1 in CU2, does it somehow
go back to CU1 and remove the
declaration/make references refer to the
definition in CU2? I figured it'd just
leave the declaration and references to
it as-is, then add the definition and
use that from CU2 onwards? <br>
</div>
</div>
</div>
</blockquote>
<p>For the processing of the types, it do not go
back. <br>
This "I figured it was one-pass, and that it
only ever refers back to types in previous
CUs" <br>
and this "I figured it'd just leave the
declaration and references to it as-is, then
add the definition and use that from CU2
onwards" are correct. <br>
</p>
</div>
</blockquote>
<div>Great - thanks for explaining/confirming! </div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> <br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>With multi-thread implementation
such situation would arise more
often <br>
for type references and so more
offsets should be fixed during
additional pass.<br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>DWARFLinker could create
additional artificial
compile unit and put all
merged types there. Later
patch all type references
to point into this
additional compilation
unit. No any bits would
be duplicated in that
case. The performance
improvement could be
achieved due to less
amount of the copied DWARF
and due to the fact that
type references could be
updated when DWARF is
cloned(no need in
additional pass for that).<br>
</p>
</div>
</blockquote>
<div>"later patch all type
references to point into this
additional compilation unit" -
that's the additional pass
that people are probably
talking/concerned about.
Rewalking all the DWARF. The
current dsymutil approach, as
far as I know, is single pass
- it knows the final, absolute
offset to the type from the
moment it emits that
type/needs to refer to it. <br>
</div>
</div>
</div>
</blockquote>
<p>Right. Current dsymutil approach is
single pass. And from that point of
view, solution <br>
which you`ve described(to produce a
declaration of the type, with the
relevant members) <br>
allows to keep that single pass
implementation.<br>
<br>
But there is a restriction for
current dsymutil approach: To
process inter-CU references <br>
it needs to load all DWARF into the
memory(While it analyzes which part
of DWARF is live, <br>
it needs to have all CUs loaded into
the memory).</p>
</div>
</blockquote>
<div>All DWARF for a single file (which
for dsymutil is mostly a single CU,
except with LTO I guess?), not all DWARF
for all inputs in memory at once, yeah?
<br>
</div>
</div>
</div>
</blockquote>
<p>right. In dsymutil case - all DWARF for a
single file(not all DWARF for all inputs in
memory at once).<br>
But in llvm-dwarfutil case single file
contains DWARF for all original input object
files and it all becomes<br>
loaded into memory.<br>
</p>
</div>
</blockquote>
<div>Yeha, would be great to try to go CU-by-CU. </div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>That leads to huge memory usage. <br>
It is less important when source is
a set of object files(like in
dsymutil case) and this <br>
become a real problem for
llvm-dwarfutil utility when source
is a single file(With current <br>
implementation it needs 30G of
memory for compiling clang binary).<br>
</p>
</div>
</blockquote>
<div>Yeah, that's where I think you'd need
a fixup pass one way or another -
because cross-CU references can mean
that when you figure out a new layout
for CU5 (because it has a duplicate type
definition of something in CU1) then you
might have to touch CU4 that had an
absolute/cross-CU forward reference to
CU5. Once you've got such a fixup pass
(if dsymutil already has one? Which,
like I said, I'm confused why it would
have one/that doesn't match my very
vague understanding) then I think you
could make dsymutil work on a per-CU
basis streaming things out, then fixing
up a few offsets.<br>
</div>
</div>
</div>
</blockquote>
<p>When dsymutil deduplicates types it changes
local CU reference into inter-CU reference(so
that CU2(next) could reference type definition
from CU1(prev)). To do this change it does not
need to do any fixups currently.<br>
<br>
When dsymutil meets already existed(located in
the input object file) inter-CU reference
pointing into the CU which has not been
processed yet(and then its offset is unknown)
it marks it as "forward reference" and patches
later during additional pass "fixup forward
references" at a time when offsets are known.
<br>
</p>
</div>
</blockquote>
<div>OK, so limited 2 pass system. (does it do that
second pass once at the end of the whole dsymutil
run, or at the end of each input file? (so if an
input file has two CUs and the first CU references
a type in the second CU - it could write the first
CU with a "forward reference", then write the
second CU, then fixup the forward reference - and
then go on to the next file and its CUs - this
could improve performance by touching recently
used memory/disk pages only, rather than going all
the way back to the start later on when those
pages have become cold)</div>
</div>
</div>
</blockquote>
<p>yes, It does it in the end of each input file.</p>
<p><br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> <br>
If CUs would be processed in parallel their
offsets would not be known at the moment when
local type reference would be changed into
inter-CU reference. So we would need to do the
same fix-up processing for all references to
the types like we already do for other
inter-CU references.<br>
</p>
</div>
</blockquote>
<div>Yeah - though the existence of this second
"fixup forward references" system - yeah, could
just use it much more generally as you say. Not an
extra pass, just the existing second pass but
having way more fixups to fixup in that pass.</div>
</div>
</div>
</blockquote>
If we would be able to change the algorithm in such way :
<br>
<br>
1. analyse all CUs.<br>
2. clone all CUs.<br>
<br>
Then we could create a merged type table(artificial CU
containing types) during step1. <br>
If that type table would be written first, then all
following CUs could use known offsets <br>
to the types and we would not need additional fix-up
processing for type references. <br>
It would still be necessary to fix-up other inter-CU
references. But it would not be necessary <br>
to fix-up type references (which constitute the vast
majority).<br>
</div>
</blockquote>
<div><br>
</div>
<div>To me, that sounds more expensive than the fixup forward
references pass.</div>
</div>
</div>
</blockquote>
<p>If we would speak about direct comparison then yes loading DWARF
one more time looks more expensive than fixup forward references
pass. But if we would speak about the general picture then it
could probably be beneficial:<br>
<br>
1. merging types would lead to a smaller size of resulting DWARF.
This would speed up the process.<br>
f.e. If we would switch "odr types deduplication" off in
current implementation then it would increase execution time two
times. That is because more DWARF should be cloned and written in
the result. Implementation of "merging types" would probably have
a similar effect <br>
- It would speed-up the overall process. So from one side
additional step for loading DWARF would <br>
decrease performance but a smaller amount of resulting data
would increase performance.<br>
<br>
2. When types would be put in the first CU then we would have a
simple strategy for our liveness analysis algorithm: just always
keep the first CU in memory. This allows us to speed up our
liveness analysis step.<br>
<br>
Anyway, all the above is just an idea for future work. Currently,
I am going to implement multithread processing for CUs loaded into
memory and having the same type of processing as it currently
is(Which assumes that "fixup forward references pass" started to
do more work by fixing types references).<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:CAENS6EsbwUnre9DFtq-Xs=QFMyr-6BTgOGwcOCX-RPFFdFDTLQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div> <br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p> <br>
</p>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Without loading all CU into the
memory it would require two passes
solution. First to analyze <br>
which part of DWARF relates to live
code and then second pass to
generate the result. <br>
</p>
</div>
</blockquote>
<div>Not sure it'd require any more second
pass than a "fixup" pass, which it
sounds like you're saying it already
has? <br>
</div>
</div>
</div>
</blockquote>
<p>It looks like it would need an additional
pass to process inter-CU references(existed in
incoming file) if we do not want to load all
CUs into memory.<br>
</p>
</div>
</blockquote>
<div>Usually inter-CU references aren't used, except
in LTO - and in LTO all the DWARF deduplication
and function discarding is already done by the IR
linker anyway. (ThinLTO is a bit different, but
really we'd be better off teaching it the extra
tricks anyway (some can't be fixed in ThinLTO -
like emitting a "Home" definition of an inline
function, only to find out other ThinLTO
backend/shards managed to optimize away all uses
of the function... so some cleanup may be useful
there)). It might be possible to do a more
dynamic/rolling cache - keep only the CUs with
unresolved cross-CU references alive and only keep
them alive until their cross-CU references are
found/marked alive. This should make things no
worse than the traditional dsymutil case - since
cross-CU references are only effective/generally
used within a single object file (it's possible to
create relocations for them into other files - but
I know LLVM doesn't currently do this and I don't
think GCC does it) with multiple CUs anyway - so
at most you'd keep all the CUs from a single
original input file alive together.<br>
</div>
</div>
</div>
</blockquote>
But, since it is a DWARF documented case the tool should
be ready for such case(when inter-CU <br>
references are heavily used).</div>
</blockquote>
<div><br>
Sure - but by implementing a CU liveness window like that
(keeping CUs live only so long as they need to be rather
than an all-or-nothing approach) only especially quirky
inputs would hit the worst case while the more normal inputs
could perform better.<br>
</div>
</div>
</div>
</blockquote>
<p>It is not clear what should be put in such CU liveness window. If
CU100 references CU1 - how could we know that we need to put CU1
into CU liveness window before we processed CU100?<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:CAENS6EsbwUnre9DFtq-Xs=QFMyr-6BTgOGwcOCX-RPFFdFDTLQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div> Moreover, llvm-dwarfutil would be the tool producing <br>
exactly such situation. The resulting file(produced by
llvm-dwarfutil) would contain a lot of <br>
inter-CU references. Probably, there is no practical
reasons to apply llvm-dwarfutil to the same <br>
file twice but it would be a good test for the tool.<br>
</div>
</blockquote>
<div><br>
It'd be a good stress test, but not necessarily something
that would need to perform the best because it wouldn't be a
common use case.<br>
</div>
</div>
</div>
</blockquote>
<p>I agree that we should not slow down the DWARFLinker in common
cases only because we need to support the worst cases.<br>
But we also need to implement a solution which works in some
acceptable manner for the worst case. The current solution -
loading everything in memory - makes it hard to use in a
non-dsymutil scenario(llvm-dwarfutil).<br>
</p>
<p>There could be several things which could be used to decide
whether we need to go on a light or heavy path:<br>
<br>
1. If the input contains only a single CU we do not need to unload
it from memory. Thus - we would not need to do an extra DWARF
loading pass.<br>
2. If abbreviations from the whole input file do not contain
inter-CU references then while doing liveness analysis, we do not
need to wait until other CUs are processed.<br>
<br>
Then that scheme would be used for worst cases:<br>
<br>
1. for (CU : CU1...CU100) {<br>
load CU.<br>
analyse CU.<br>
unload CU.<br>
} <br>
2. for (CU : CU1...CU100) {<br>
load CU.<br>
clone CU.<br>
unload CU.<br>
} <br>
3. fixup forward references.<br>
<br>
and that scheme for light cases:<br>
<br>
1. for (CU : CU1...CU100) {<br>
load CU.<br>
analyse CU.<br>
clone CU.<br>
unload CU.<br>
}<br>
2. fixup forward references.<br>
</p>
<blockquote type="cite"
cite="mid:CAENS6EsbwUnre9DFtq-Xs=QFMyr-6BTgOGwcOCX-RPFFdFDTLQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>Generally, I think we should not assume that inter-CU
references would be used in a limited way.<br>
<br>
Anyway, if this scheme: <br>
<br>
1. analyse all CUs.<br>
2. clone all CUs.<br>
<p>would work slow then we would need to continue with
one-pass solution and not support complex closely
coupled inputs.<br>
</p>
</div>
</blockquote>
<div><br>
</div>
<div>yeah, certainly seeing the data/experiments will be
interesting, if you end up implementing some different
strategies, etc.<br>
<br>
I guess one possibility for parallel generation could be
something more like Microsoft's approach with a central
debug info server that compilers communicate with - not that
exact model, I mean, but if you've got parallel threads
generating reduced DWARF into separate object files - they
could communicate with a single thread responsible for type
emission - the type emitter would be given types from the
separate threads and compute their size, queue them up to be
streamed out to the type CU (& keep the source CU alive
until that work was done) - such a central type emitter
could quickly determine the size of the type to be emitted
and compute future type offsets (eg: if 5 types were in the
queue, it could've figured out the offset of those types
already) to answer type offset queries quickly and unblock
the parallel threads to continue emitting their CUs
containing type references.<br>
</div>
</div>
</div>
</blockquote>
<p>yes. Thank you. Would think about it.<br>
</p>
<p>Alexey.<br>
</p>
<blockquote type="cite"
cite="mid:CAENS6EsbwUnre9DFtq-Xs=QFMyr-6BTgOGwcOCX-RPFFdFDTLQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
- Dave </div>
</div>
</div>
</blockquote>
</body>
</html>