[llvm-dev] Metadata in LLVM back-end

Mon Aug 17 23:27:59 PDT 2020

Am 07/08/20 um 22:54 schrieb David Greene:
> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>>> As with IR-level metadata, there should be no guarantee that metadata is
>>> preserved and that it's a best-effort thing.  In other words, relying on
>>> metadata for correctness is probably not the thing to do.
>> Ok, I made a mistake stating that metadata should be *preserved*; what
>> I really meant is to preserve the *information* that such metadata
>> represent.
> We do have one way of doing that now that's nearly foolproof in terms of
> accidental loss: intrinsics.  Intrinsics AFAIK are never just deleted
> and have to be explicitly handled at some point.  Intrinsics may not
> work well for your use-case for a variety of reasons but they are an
> option.
>
> I'm mostly just writing this to get thoughts in my head organized.  :)
The only problem with intrinsics, for me, was the need to mirror the
already existing instructions. As you pointed out, if there's a way to map
intrinsics and instructions, there would be no reason to mirror the latter,
andjust use the former to carry metadata.
>>>>    It might be possible to have a dedicated data-structure for such
>>>> metadata info,    and an instance of such structure assigned to each
>>>> instruction.
>>> I'm not entirely sure what you mean by this.
>> I was imagining a per-instruction data-structure collecting metadata info
>> related to that specific instruction, instead of having several metadata info
>> directly embedded in each instruction.
> Interesting.  At the IR level metadata isn't necessarily unique, though
> it can be made so.  If multiple pieces of information were amalgamated
> into one structure that might reduce the ability to share the in-memory
> representation, which has a cost.  I like the ability of IR metadata to
> be very flexible while at the same time being relatively cheap in terms
> of resource utilization.
>
> I don't always like that IR metadata is not scoped.  It makes it more
> difficult to process the IR for a Function in isolation.  But that's a
> relatively minor quibble for me.  It's a tradeoff between convenience
> and resource utilization.
>
Uhm...could I ask you to elaborate a bit more on the "limitation on
in-memory
representation sharing"? It is not clear to me how this would cause a
problem.
>>> That's a great use-case.  I do wonder about your use of "essential"
>>> though.
>> With *essential* I mean fundamental for satisfying a specific target
>> security property.
>>> Is it needed for correctness?  If so an intrinsics-based solution
>>> may be better.
>> Uhm...it might sound as a naive question, but what do you mean with
>> *correctness*?
> I mean will the compiler generate incorrect code or otherwise violate
> some contract.  In your secure compilation example, if the compiler
> *promises* that the generated code will be "secure" then that's a
> contract that would be violated if the metadata were lost.
You got the point: if no metadata are provided/lost, the codegen phase
is not
able to fulfill the contract (in my use case, generate code that is
"secure").
>> I employed intrinsics as a mean for carrying metadata, but, by my
>> experience, I am not sure they can be resorted as a valid alternative:
>>
>>  - For each llvm-ir instruction employed in my project (e.g., store),
>> a semantically    equivalent intrinsic is declared, with particular
>> parameters representing    metadata (i.e., first-class metadata are
>> represented by specific intrinsic's    parameters).
>>
>>  - During the lowering, each ad-hoc intrinsic must be properly
>> handled, manually    adding the proper legalization operations, DAG
>> combinations and so on.
>>
>>  - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
>> pseudo-instructions),    metadata are passed to the MIR representation
>> of the program.
>>
>> In particular, the second point rises a critical problem in terms of
>> optimizations (e.g., intrinsic store + intrinsic trunc are not
>> automatically converted into a intrinsic truncated store).Then, the
>> backend must be instructed to perform such optimizations, which are
>> actually already performed on non-intrinsic instructions (e.g., store
>> + trunc is already converted into a truncated store).
> Gotcha.  That certainly is a lot of burden.  Do the intrinsics *have to*
> mirror the existing instructions exactly or could a more generic
> intrinsic be defined that took some data as an argument, for example a
> pointer to a static string?  Then each intrinsic instance could
> reference a static string unique to its context.

> I have not really thought this through, just throwing out ideas in a
> devil's advocate sort of way.
I like brainstorming ;)
>
> In my case using intrinsics would have to tie the intrinsic to the
> instruction it is annotating.  This seems similar to your use-case.
> This is straightforward to do if everything is SSA but once we've gone
> beyond that things get a lot more complicated.  The mapping of
> information to specific instructions really does seem like the most
> difficult bit.
No, intrinsics does not have to mirror existing instructions; yes, they
can be used just to carry around specific data as arguments.
Nonetheless, there
we have our (implementation) problem: how to map info (e.g., intrinsics) to
instruction, and viceversa?

I am really curious on how would you perform it in the pre-RA phase :)

>> Instead of re-inventing the wheel, and since the backend should be
>> nonetheless modified in order to support optimizations on intrinsics,
>> I would rather prefer to insert some sort of mechanism to support
>> metadata attachment as first-class elements of the IR/MIR, and
>> automatic merging of metadata, for instance.
> Can you explain a bit more what you mean by "first-class?"
Never mind, I used the wrong terminology: I just meant to directly
embed metadata in the IR/MIR.
>> In any case, I wonder if metadata at codegen level is actually a thing
>> that the community would benefit (then, justifying a potentially huge
>> and/or long serie of patches), or it is something in which only a
>> small group would be interested in.
> I would also like to know this.  Have others found the need to convey
> information down to codegen and if so, what approaches were considered
> and tried?
>
> Maybe this is a niche requirement but I really don't think it is.  I
> think it more likely that various hacks/modifications have been made
> over the years to sufficiently approximate a desired outcome and that
> this has led to not insignificant technical debt.
>
> Or maybe I just think that because I've worked on a 40-year-old compiler
> for my entire career.  :)
>
>                  -David


Best regards,
Lorenzo