[llvm-dev] Metadata in LLVM back-end

Fri Jul 31 13:47:38 PDT 2020

Thanks for keeping this going, Lorenzo.

Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:

>> The first questions need to be “what does it mean?”, “how does it
>> work?”, and “what is it useful for?”.  It is hard to evaluate a
>> proposal without that.
>
> Hi everyone,
>
> - "What does it mean?": it means to preserve specific information,
> represented as   metadata assigned to instructions, from the IR level,
> down to the codegen phases.

An important part of the definition is "how late?"  For my particular
uses it would be right up until lowering of asm pseudo-instructions,
even after regalloc and scheduling.  I don't know whether someone might
need metadata even later than that (at asm/obj emission time?) but if
metadata is supported on Machine IR then it shouldn't be an issue.

As with IR-level metadata, there should be no guarantee that metadata is
preserved and that it's a best-effort thing.  In other words, relying on
metadata for correctness is probably not the thing to do.

> - "How does it work?": metadata should be preserved during the several
>    back-end transformations; for instance, during the lowering phase,
> DAGCombine    performs several optimization to the IR, potentially
> combining several    instructions. The new instruction should, then,
> assigned with metadata obtained    as a proper combination of the
> original ones (e.g., a union of metadata    information).

I want to make it clear that this is expensive to do, in that the number
of changes to the codegen pipeline is quite extensive and widespread.  I
know because I've done it*.  :)  It will help if there are utilities
people can use to merge metadata during DAG transformation and the more
we make such transfers and combinations "automatic" the easier it will
be to preserve metadata.

Once the mechanisms are there it also takes effort to keep them going.
For example if a new DAG transformation is done people need to think
about metadata.  This is where "automatic" help makes a real difference.

* By "it" I mean communicate information down to late phases of codegen.
I don't have a "metadata in codegen" patch as such.  I simply cobbled
something together in our downstream fork that works for some very
specific use-cases.

>    It might be possible to have a dedicated data-structure for such
> metadata info,    and an instance of such structure assigned to each
> instruction.

I'm not entirely sure what you mean by this.

> - "What is it useful for?": I think it is quite context-specific; but,
>   in general, it is useful when some "higher-level"   information
> (e.g., that canbe discovered only before the back-end   stage of the
> compiler) are required in the back-end to perform "semantic"-related  
> optimizations.

That's my use-case.  There's semantic information codegen would like to
know but is really much more practical to discover at the LLVM IR level
or even passed from the frontend.  Much information is lost by the time
codegen is hit and it's often impractical or impossible for codegen to
derive it from first principles.

> To give an (quite generic) example where such codegen metadata may be
> useful: in the field of "secure compilation", preservation of security
> properties during the compilation phases is essential; such properties
> are specified in the high-level specifications of the program, and may
> be expressed with IR metadata. The possibility to keep such IR
> metadata in the codegen phases may allow preservation of properties
> that may be invalidated by codegen phases.

That's a great use-case.  I do wonder about your use of "essential"
though.  Is it needed for correctness?  If so an intrinsics-based
solution may be better.

My use-cases mostly revolve around communication with a proprietary
frontend and thus aren't useful to the community, which is why I haven't
pursued this with any great vigor before this.

I do have uses that convey information from LLVM analyses but
unfortunately I can't share them for now.

All of my use-cases are related to optimization.  No "metadata" is
needed for correctness.

I have pondered whether intrinsics might work for my use-cases.  My fear
with intrinsics is that they will interfere with other codegen analyses
and transformations.  For example they could be a scheduling barrier.

I also have wondered about how intrinsics work within SelectionDAG.  Do
they impact dagcombine and other transformations?  The reason I call out
SelectionDAG specifically is that most of our downstream changes related
to conveying information are in DAG-related files (dagcombine, legalize,
etc.).  Perhaps intrinsics could suffice for the purposes of getting
metadata through SelectionDAG with conversion to "first-class" metadata
at the Machine IR level.  Maybe this is even an intermediate step toward
"full metadata" throughout the compilation.

                -David