[llvm-dev] Metadata in LLVM back-end

Thu Aug 6 07:47:20 PDT 2020

Am 31/07/20 um 22:47 schrieb David Greene:

@David
> Thanks for keeping this going, Lorenzo.
>
> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>>> The first questions need to be “what does it mean?”, “how does it
>>> work?”, and “what is it useful for?”.  It is hard to evaluate a
>>> proposal without that.
>> Hi everyone,
>>
>> - "What does it mean?": it means to preserve specific information,
>> represented as   metadata assigned to instructions, from the IR level,
>> down to the codegen phases.
> An important part of the definition is "how late?"  For my particular
> uses it would be right up until lowering of asm pseudo-instructions,
> even after regalloc and scheduling.  I don't know whether someone might
> need metadata even later than that (at asm/obj emission time?) but if
> metadata is supported on Machine IR then it shouldn't be an issue.
"How late" it is context-specific: even in my case, I required such
information
to be preserved until pseudo instruction expansion. Conservatively, they
could be
preserved until the last pass of codegen pipeline.

Regarding their employment in the later steps, I would not say they are not
required, sinceI worked on a specific topic of secure compilation, and I do
not have the wholepicture in mind; nonetheless, it would be possible to
test how
things work out withthe codegen and later reason on future developments.

> As with IR-level metadata, there should be no guarantee that metadata is
> preserved and that it's a best-effort thing.  In other words, relying on
> metadata for correctness is probably not the thing to do.
Ok, I made a mistake stating that metadata should be *preserved*; what
I really meant is to preserve the *information* that such metadata
represent.
>> - "How does it work?": metadata should be preserved during the several
>>    back-end transformations; for instance, during the lowering phase,
>> DAGCombine    performs several optimization to the IR, potentially
>> combining several    instructions. The new instruction should, then,
>> assigned with metadata obtained    as a proper combination of the
>> original ones (e.g., a union of metadata    information).
> I want to make it clear that this is expensive to do, in that the number
> of changes to the codegen pipeline is quite extensive and widespread.  I
> know because I've done it*.  :)  It will help if there are utilities
> people can use to merge metadata during DAG transformation and the more
> we make such transfers and combinations "automatic" the easier it will
> be to preserve metadata.
>
> Once the mechanisms are there it also takes effort to keep them going.
> For example if a new DAG transformation is done people need to think
> about metadata.  This is where "automatic" help makes a real difference.
>
> * By "it" I mean communicate information down to late phases of codegen.
> I don't have a "metadata in codegen" patch as such.  I simply cobbled
> something together in our downstream fork that works for some very
> specific use-cases.
I know what you have been through, and I can only agree with you: for the
project I mentioned above, I had to perform several changes to the whole IR
lowering phase in order to correctly propagate high-level information;
it wasn't
cheap and required a lot of effort.
>>    It might be possible to have a dedicated data-structure for such
>> metadata info,    and an instance of such structure assigned to each
>> instruction.
> I'm not entirely sure what you mean by this.

I was imagining a per-instruction data-structure collecting metadata info
related to that specific instruction, instead of having several metadata info
directly embedded in each instruction.

>> - "What is it useful for?": I think it is quite context-specific; but,
>>   in general, it is useful when some "higher-level"   information
>> (e.g., that canbe discovered only before the back-end   stage of the
>> compiler) are required in the back-end to perform "semantic"-related  
>> optimizations.
> That's my use-case.  There's semantic information codegen would like to
> know but is really much more practical to discover at the LLVM IR level
> or even passed from the frontend.  Much information is lost by the time
> codegen is hit and it's often impractical or impossible for codegen to
> derive it from first principles.
>
>> To give an (quite generic) example where such codegen metadata may be
>> useful: in the field of "secure compilation", preservation of security
>> properties during the compilation phases is essential; such properties
>> are specified in the high-level specifications of the program, and may
>> be expressed with IR metadata. The possibility to keep such IR
>> metadata in the codegen phases may allow preservation of properties
>> that may be invalidated by codegen phases.
> That's a great use-case.  I do wonder about your use of "essential"
> though.
With *essential* I mean fundamental for satisfying a specific target
security property.
>   Is it needed for correctness?  If so an intrinsics-based
> solution may be better.
Uhm...it might sound as a naive question, but what do you mean with
*correctness*?
> My use-cases mostly revolve around communication with a proprietary
> frontend and thus aren't useful to the community, which is why I haven't
> pursued this with any great vigor before this.
>
> I do have uses that convey information from LLVM analyses but
> unfortunately I can't share them for now.
>
> All of my use-cases are related to optimization.  No "metadata" is
> needed for correctness.

> I have pondered whether intrinsics might work for my use-cases.  My fear
> with intrinsics is that they will interfere with other codegen analyses
> and transformations.  For example they could be a scheduling barrier.
>
> I also have wondered about how intrinsics work within SelectionDAG.  Do
> they impact dagcombine and other transformations?  The reason I call out
> SelectionDAG specifically is that most of our downstream changes related
> to conveying information are in DAG-related files (dagcombine, legalize,
> etc.).  Perhaps intrinsics could suffice for the purposes of getting
> metadata through SelectionDAG with conversion to "first-class" metadata
> at the Machine IR level.  Maybe this is even an intermediate step toward
> "full metadata" throughout the compilation.

I employed intrinsics as a mean for carrying metadata, but,
by my experience, I am not sure they can be resorted as a valid alternative:

 - For each llvm-ir instruction employed in my project (e.g., store), a
semantically
   equivalent intrinsic is declared, with particular parameters representing
   metadata (i.e., first-class metadata are represented by specific
intrinsic's
   parameters).

 - During the lowering, each ad-hoc intrinsic must be properly handled,
manually
   adding the proper legalization operations, DAG combinations and so on.

 - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
pseudo-instructions),
   metadata are passed to the MIR representation of the program.

In particular, the second point rises a critical problem in terms of
optimizations
(e.g., intrinsic store + intrinsic trunc are not automatically converted
into a
intrinsic truncated store).Then, the backend must be instructed to
perform such
optimizations, which are actually already performed on non-intrinsic
instructions
(e.g., store + trunc is already converted into a truncated store).

Instead of re-inventing the wheel, and since the backend should be
nonetheless
modified in order to support optimizations on intrinsics, I would rather
prefer to
insert some sort of mechanism to support metadata attachment as
first-class elements
of the IR/MIR, and automatic merging of metadata, for instance.

----

@Chris

I may be wrong (in such case, please, correct me), but if I got it
correctly,
source-level debugging metadata are "external" (i.e., not a first-class
element
of the llvm-ir), and their management involve a great effort.

As described above, in my project I used metadata as first class
elements of the
IR/MIR; I found this approach more immediate and simpler to handle, although
some passes and transformation must be modified.

Then, I agree with you saying that metadata infos should be first-class
elements of
the IR/MIR (or, at least, "packed" into a structure being first-class
part of the
IR/MIR).

----

In any case, I wonder if metadata at codegen level is actually a thing
that the
community would benefit (then, justifying a potentially huge and/or long
serie of
patches), or it is something in which only a small group would be
interested in.

Cheers
-- Lorenzo