[PATCH] D32061: [wip] Bitcode: Write the irsymtab to disk.

Thu Apr 20 17:14:22 PDT 2017

pcc added inline comments.

================
Comment at: llvm/include/llvm/Object/IRSymtab.h:122
+  /// in symbol table format, symbol enumeration order and so on.
+  Str Producer;
+
----------------
mehdi_amini wrote:
> tejohnson wrote:
> > pcc wrote:
> > > mehdi_amini wrote:
> > > > pcc wrote:
> > > > > mehdi_amini wrote:
> > > > > > pcc wrote:
> > > > > > > mehdi_amini wrote:
> > > > > > > > tejohnson wrote:
> > > > > > > > > pcc wrote:
> > > > > > > > > > tejohnson wrote:
> > > > > > > > > > > Does this mean that the symbol table will be thrown away every time the llvm revision changes? Can we avoid that by having a symbol table version number?
> > > > > > > > > > That would be tricky to get right. There are many things that could change the symbol table, even without changing the format.
> > > > > > > > > > 
> > > > > > > > > > For example, a change such as D31443 would re-order the symbol table, despite not touching the bitcode reader at all.
> > > > > > > > > > 
> > > > > > > > > > As another example, suppose that we decide to remove an intrinsic from LLVM, and we upgrade existing IR by removing the intrinsic declaration (and changing calls to do something else). That by itself would change the symbol table.
> > > > > > > > > > 
> > > > > > > > > > So I think the safest option would be to accept only symbol tables from the same revision of LLVM.
> > > > > > > > > > For example, a change such as D31443 would re-order the symbol table, despite not touching the bitcode reader at all.
> > > > > > > > > 
> > > > > > > > > I guess this makes a different for regularLTO where we need to correlate the symbol resolutions built from the symbol table with the symbols in the module. Presumably it shouldn't matter for something like ThinLTO though, where in the thin link we eventually shouldn't have to look at the module ir at all. Not that this helps much since we don't know the use case. Unfortunate, but I can't think of a way to avoid it offhand, without adding some other way to correlate the module symbols to the symbol table symbols.
> > > > > > > > >So I think the safest option would be to accept only symbol tables from the same revision of LLVM.
> > > > > > > > 
> > > > > > > > But that's a design decision. The symbol table could be stored as bitcode with the same exact restriction as all the usual compatibility, and avoid any duplication with the rest of the module (i.e. not being able to regenerate the symbol table from the rest of the bitcode is *not* a bad thing to me, quite the opposite).
> > > > > > > > I'm not sure what makes a symbol table "special" here.
> > > > > > > > 
> > > > > > > Even with a bitcode design that somehow avoided all duplication between the module and the symbol table, I don't see why we wouldn't still have the same set of issues when upgrading old bitcode.
> > > > > > I don't follow.
> > > > > > 
> > > > > The ModuleSymbolTable is the "source of truth" for a module's symbol table, and the irsymtab is effectively a "cache" of the ModuleSymbolTable (an irsymtab is derived from a ModuleSymbolTable). As previously mentioned, a ModuleSymbolTable can change as a result of arbitrary changes to the compiler. The situation we want to avoid is where irsymtab::read gives us a different set of symbols than the ModuleSymbolTable, which could happen if the irsymtab was created using an old version of LLVM. We are robust against such situations by checking the revision number embedded in the bitcode file, and rebuilding the irsymtab if it does not match.
> > > > > 
> > > > > One possible way of being resistant to changes in the symbol table suggests itself: we could discard resolutions for symbols that do not appear in the ModuleSymbolTable, and make up resolutions for symbols that did not appear in the irsymtab. But that doesn't seem particularly sound, because the linker still needs to see the correct set of symbols at symbol resolution time for correctness. To go back to the example where we remove an intrinsic, if the intrinsic is upgraded by replacing it with a call to a runtime library function, this would result in a new undefined symbol in the symbol table. Although we could invent a resolution for it (non-prevailing + non-local + external would do), this new undefined symbol may require the linker to have different behaviour at symbol resolution time. For example, the new undefined symbol may require additional archive members to be loaded, which the linker may not be prepared to do once it has completed LTO. The simplest way to handle that situation is by rebuilding the irsymtab, which would ensure that the undefined symbol is available at symbol resolution time.
> > > > > The ModuleSymbolTable is the "source of truth" for a module's symbol table, and the irsymtab is effectively a "cache" of the ModuleSymbolTable (an irsymtab is derived from a ModuleSymbolTable)
> > > > 
> > > > Well that is exactly what I qualified right above:
> > > > 
> > > > > But that's a design decision. 
> > > > 
> > > > It does not *have to be* this way. 
> > > I mean sure, it may be possible to redesign how the bitcode reader works so that the symbol table is implicitly "upgraded" as it is read, and the module bitcode reader uses IR entities derived from the "upgraded" symbol table. That may be something to consider exploring if we ever redesign the bitcode format. My gut feeling, though, is that it would be too complicated to be worth it.
> > Talking through this with pcc this morning, I now tend to agree that it doesn't make sense to optimize for the case where the llvm revision has changed. If the format of the symbol table changes so that fields like linkage, visibility etc are shared between the irsymtab and the IR, then of course it must be kept in sync with the IR at all times. But at least with the format we have in D31364, where the irsymtab is essentially a cache of the symbol information in the module, this seems fine.
> > 
> > Mehdi, do you still have concerns?
> > Mehdi, do you still have concerns?
> 
> Basically the logic I'm reading here is "the design choice implemented here does not optimize for use-case A so use-case A does not make sense".
> 
> This seems completely flawed to me, so I clearly don't understand the reasoning.
The choice being made here is not set in stone. We may decide on a different approach later. But it would be best to consider that orthogonally.

Indeed, if we start with the design that you appear to be advocating, it would not only make the code more complicated, but make it that much more difficult to back out of that decision if it turns out to be wrong, due to the need to continuously support upgrading from the old design. It is not clear to me that a design that avoids duplication like you appear to be proposing would be better because the amount of information being duplicated is, frankly, not large (at least compared to names, which are already being shared) and it is unclear how such a design would address module inline asm symbols.

On the other hand, if the design you are advocating turns out to be better, we can move from this design to that one without the need for upgrade support.

Normally I would expect the consumer's revision to be the same as the producer's, so the need to optimize upgrades does not seem strong. That said, I would be prepared to acknowledge such a need. But there are other ways of optimizing upgrades that do not involve creating compatibility constraints. For example, you can imagine writing upgraded bitcode files (or a marker indicating that the file does not require an upgrade) to the ThinLTO cache directory. So I do not want to make any decisions in this space prematurely.

https://reviews.llvm.org/D32061