[llvm-dev] Metadata in LLVM back-end

Fri Aug 7 13:54:32 PDT 2020

Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:

>> As with IR-level metadata, there should be no guarantee that metadata is
>> preserved and that it's a best-effort thing.  In other words, relying on
>> metadata for correctness is probably not the thing to do.

> Ok, I made a mistake stating that metadata should be *preserved*; what
> I really meant is to preserve the *information* that such metadata
> represent.

We do have one way of doing that now that's nearly foolproof in terms of
accidental loss: intrinsics.  Intrinsics AFAIK are never just deleted
and have to be explicitly handled at some point.  Intrinsics may not
work well for your use-case for a variety of reasons but they are an
option.

I'm mostly just writing this to get thoughts in my head organized.  :)

>> * By "it" I mean communicate information down to late phases of codegen.
>> I don't have a "metadata in codegen" patch as such.  I simply cobbled
>> something together in our downstream fork that works for some very
>> specific use-cases.

> I know what you have been through, and I can only agree with you: for
> the project I mentioned above, I had to perform several changes to the
> whole IR lowering phase in order to correctly propagate high-level
> information; it wasn't cheap and required a lot of effort.

I know your pain.  :)

>>>    It might be possible to have a dedicated data-structure for such
>>> metadata info,    and an instance of such structure assigned to each
>>> instruction.
>> I'm not entirely sure what you mean by this.
>
> I was imagining a per-instruction data-structure collecting metadata info
> related to that specific instruction, instead of having several metadata info
> directly embedded in each instruction.

Interesting.  At the IR level metadata isn't necessarily unique, though
it can be made so.  If multiple pieces of information were amalgamated
into one structure that might reduce the ability to share the in-memory
representation, which has a cost.  I like the ability of IR metadata to
be very flexible while at the same time being relatively cheap in terms
of resource utilization.

I don't always like that IR metadata is not scoped.  It makes it more
difficult to process the IR for a Function in isolation.  But that's a
relatively minor quibble for me.  It's a tradeoff between convenience
and resource utilization.

>> That's a great use-case.  I do wonder about your use of "essential"
>> though.

> With *essential* I mean fundamental for satisfying a specific target
> security property.

>> Is it needed for correctness?  If so an intrinsics-based solution
>> may be better.

> Uhm...it might sound as a naive question, but what do you mean with
> *correctness*?

I mean will the compiler generate incorrect code or otherwise violate
some contract.  In your secure compilation example, if the compiler
*promises* that the generated code will be "secure" then that's a
contract that would be violated if the metadata were lost.

> I employed intrinsics as a mean for carrying metadata, but, by my
> experience, I am not sure they can be resorted as a valid alternative:
>
>  - For each llvm-ir instruction employed in my project (e.g., store),
> a semantically    equivalent intrinsic is declared, with particular
> parameters representing    metadata (i.e., first-class metadata are
> represented by specific intrinsic's    parameters).
>
>  - During the lowering, each ad-hoc intrinsic must be properly
> handled, manually    adding the proper legalization operations, DAG
> combinations and so on.
>
>  - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
> pseudo-instructions),    metadata are passed to the MIR representation
> of the program.
>
> In particular, the second point rises a critical problem in terms of
> optimizations (e.g., intrinsic store + intrinsic trunc are not
> automatically converted into a intrinsic truncated store).Then, the
> backend must be instructed to perform such optimizations, which are
> actually already performed on non-intrinsic instructions (e.g., store
> + trunc is already converted into a truncated store).

Gotcha.  That certainly is a lot of burden.  Do the intrinsics *have to*
mirror the existing instructions exactly or could a more generic
intrinsic be defined that took some data as an argument, for example a
pointer to a static string?  Then each intrinsic instance could
reference a static string unique to its context.

I have not really thought this through, just throwing out ideas in a
devil's advocate sort of way.

In my case using intrinsics would have to tie the intrinsic to the
instruction it is annotating.  This seems similar to your use-case.
This is straightforward to do if everything is SSA but once we've gone
beyond that things get a lot more complicated.  The mapping of
information to specific instructions really does seem like the most
difficult bit.

> Instead of re-inventing the wheel, and since the backend should be
> nonetheless modified in order to support optimizations on intrinsics,
> I would rather prefer to insert some sort of mechanism to support
> metadata attachment as first-class elements of the IR/MIR, and
> automatic merging of metadata, for instance.

Can you explain a bit more what you mean by "first-class?"

> In any case, I wonder if metadata at codegen level is actually a thing
> that the community would benefit (then, justifying a potentially huge
> and/or long serie of patches), or it is something in which only a
> small group would be interested in.

I would also like to know this.  Have others found the need to convey
information down to codegen and if so, what approaches were considered
and tried?

Maybe this is a niche requirement but I really don't think it is.  I
think it more likely that various hacks/modifications have been made
over the years to sufficiently approximate a desired outcome and that
this has led to not insignificant technical debt.

Or maybe I just think that because I've worked on a 40-year-old compiler
for my entire career.  :)

                 -David