[llvm-dev] Metadata in LLVM back-end

Wed Jun 16 16:49:49 PDT 2021

Hi all,

Thanks for resuscitating this discussion.

@Lorenzo please pardon me for dropping this for quite a while. It was
indeed a tense period for me.

@Matt yes it'd be awesome if you can sketch an RFC, we can definitely
iterate over to come up with more polished versions. I'd be more than happy
to help in any way I can.

Son Tuan Vu

On Wed, 16 Jun 2021 at 22:42, Matt Morehouse <mascasa at google.com> wrote:

> Thanks for the update, Lorenzo.
>
> I have some free time to work on an RFC, but I'm unfamiliar with how the
> implementation details would work.
>
> If I dig through this thread and try to draft something, would you and/or
> Son be willing to contribute?
>
> Thanks,
> Matt
>
> On Wed, Jun 16, 2021 at 12:02 PM Lorenzo Casalino <
> lorenzo.casalino93 at gmail.com> wrote:
>
>> Hello Matt,
>>
>> I think that the RFC drafting went stale some months ago due to heavy
>> workload on which all the partecipants were subject to.
>>
>> As of now, I do not know when the RFC will be actually drafted and sent.
>>
>> Cheers,
>> Lorenzo
>>
>> Le 16 juin 2021 à 1:32 AM, Matt Morehouse <mascasa at google.com> a écrit :
>>
>> 
>> Did anyone send an RFC for this?
>>
>> First-class metadata would be exceptionally useful for sanitizers and
>> other dynamic tools.  For
>> example, we want to construct PC-keyed metadata tables in the binary
>> (without affecting the
>> generated code), to inform program behavior at runtime or to allow
>> offline analysis.  A
>> prerequisite is to actually propagate the metadata we need from the Clang
>> frontend or LLVM
>> middle-end down to the assembly printer.
>>
>> Our team has brainstormed many use cases:
>>
>> - *GWP-TSan* <https://youtu.be/2KvaKEyMVEU>:  storing PCs of accesses
>> lowered from C++ atomics, to filter them out from race
>>   detection.
>>   *  List<atomic access PC>
>>
>> - *Stack trace compression*:  storing a conservative call graph
>> <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html>, for
>> use in decompressing stack
>>   traces offline.
>>   * Map[callsite PC] -> List<callee PC>
>>
>> - *no_sanitize attributes*:  storing a map of functions that have the
>> no_sanitize("...")
>>   attribute to the associated sanitizer, for filtering out from GWP-*San.
>> Ideally we do not
>>   introduce new no_sanitize string literals, but simply rely on existing
>> ones (e.g. a
>>   no_sanitize("thread") works for both TSan but also GWP-TSan).
>>   *  Map[Func] -> SanitizerKind
>>
>> - *Fuzzing aid/CFG reconstruction*:  marking coverage PCs as function
>> entry/exit or # of
>>   outgoing edges from BB (allows to find gaps in coverage frontier).
>>
>> - *Type-aware malloc and heap profiling*:  enable the allocator to get
>> the type for a given new
>>   call, to optimize for expected usage of the allocation.
>>   *  Map[new callsite PC] -> object type
>>
>> - *Other*:  potential use cases for future bug-finding tools
>> (GWP-assert, GWP-MSan,
>>   GWP-DFSan, GWP-UBSan).
>>
>> First-class metadata would open the door to some really cool things.
>>
>> Thanks,
>> Matt Morehouse
>>
>>
>> On Wed, Jan 6, 2021 at 5:56 AM Lorenzo Casalino via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Dear Tuan,
>>>
>>> How are you doing? Did you manage to start the draft for the RFC?
>>>
>>>
>>> I take this opportunity to wish you all the best for this new year :)
>>>
>>> Best regards,
>>> Lorenzo Casalino
>>> Le 10/11/20 à 09:27, Lorenzo Casalino a écrit :
>>>
>>>
>>> Le 09/11/20 à 00:30, Son Tuan VU a écrit :
>>>
>>> Hi,
>>>
>>> Thank you all for keeping this going. Indeed I was not aware that the
>>> discussion was going on, I am really sorry for this late reply.
>>>
>>> Nice to hear you again! Thank you for starting this thread ;)
>>>
>>> I understand Chris' point about metadata design. Either the metadata
>>> becomes stale or removed (if we do not teach transformations to preserve
>>> it), or we end up modifying many (if not all) transformations to keep the
>>> data intact.
>>> Currently in the IR, I feel like the default behavior is to
>>> ignore/remove the metadata, and only a limited number of transformations
>>> know how to maintain and update it, which is a best-effort approach.
>>> That being said, my initial thought was to adopt this approach to the
>>> MIR, so that we can at least have a minimal mechanism to communicate
>>> additional information to various transformations, or even dump it to the
>>> asm/object file.
>>> In other words, it is the responsibility of the users who introduce/use
>>> the metadata in the MIR to teach the transformations they selected how to
>>> preserve their metadata. A common API to abstract this would definitely
>>> help, just as combineMetadata() from lib/Transforms/Utils/Local.cpp does.
>>>
>>> Unfortunately, I never worked with the LLVM-IR Metadata (I almost
>>> focused on the back-end
>>> and I just scratched the LLVM's middle-end), but I see your point.
>>>
>>> Clearly, applying the needed modifications to all the back-end
>>> transformations/optimizations
>>> is unfeasible and, probably, not worth it -- different users may have
>>> different requirements/needs
>>> regarding a specific pass.
>>>
>>> I like the idea of a common API to handle the MIR metadata, and let the
>>> end user handle
>>> such data. Of course, if the community encounters common cases while
>>> handling the metadata, such
>>> cases may be integrated with the upstream project.
>>>
>>> Nonetheless, the main point of this thread is to preserve middle-end
>>> metadata down to the
>>> back-end, right after the Instruction Selection phase. Hence, despite
>>> the need of the end user, a
>>> "preserve-all" policy during the lowering stage is required, which will
>>> involve a bit of changes,
>>> in particular in the DAGCombine pass.
>>>
>>>
>>> As for my use case, it is also security-related. However, I do not
>>> consider the metadata to be a compilation "correctness" criteria: metadata,
>>> by definition (from the LLVM IR), can be safely removed without affecting
>>> the program's correctness.
>>> If possible, I would like to have more details on Lorenzo's use case in
>>> order to see how metadata would interfere with program's correctness.
>>>
>>> I would really like to discuss here the details, but, unfortunately, I
>>> am working on a publication
>>> and, thus, I cannot disclose any detail here :(
>>>
>>> However, with "correctness" I do not refer to "I/O correctness", but the
>>> preservation of a
>>> security property expressed in the front-end (e.g., specified in the
>>> source-code) or in the
>>> middle-end (e.g., specified in the LLVM-IR, for instance by a
>>> transformation pass).
>>>
>>> From a security point-of-view, removing or altering metadata does not
>>> interfere with the I/O
>>> functionality of the code (although may impact on the performances), but
>>> may introduce
>>> vulnerabilities.
>>>
>>> As for the RFC, I can definitely try to write one, but this would be my
>>> first time doing so. But maybe it is better to start with Lorenzo's
>>> proposal, as you have already been working on this? Please tell me if you
>>> prefer me to start the RFC though.
>>>
>>> It is the first time for me too, do not worry!
>>>
>>> We could just use any other RFC as a template to get started :D
>>>
>>> I think that a structure like the following would be fine:
>>>
>>>   1. Background
>>>      1.1 Motivation
>>>      1.2 Use-cases
>>>      1.3 Other approaches
>>>   2. Goal(s)
>>>   3. Requirements
>>>   4. Drawbacks and main bottlenecks
>>>   5. Design sketch
>>>   6. Roadmap sketch
>>>   7. Potential future development
>>>
>>> It may be a bit overkill; you are warmly invited to cut/refine these
>>> points!
>>>
>>> And...no, I still have no sketch of the RFC; sorry, I had a bit of
>>> workload in these
>>> days.
>>>
>>> Yes, you can start the write up of the RFC.
>>>
>>> Quoting David:
>>>
>>>   "Since you first raised the topic [...] I want to give you right of
>>> first refusal."
>>>
>>>
>>> Have a nice day!
>>>
>>> -- Lorenzo
>>>
>>> Thank you again for keeping this going.
>>>
>>> Sincerely,
>>>
>>> - Son
>>>
>>> On Wed, Nov 4, 2020 at 6:30 PM Lorenzo Casalino <
>>> lorenzo.casalino93 at gmail.com> wrote:
>>>
>>>>
>>>> Le 04/11/20 à 17:40, David Greene a écrit :
>>>> > Sorry about the late reply.
>>>> >
>>>> > Lorenzo Casalino <lorenzo.casalino93 at gmail.com> writes:
>>>> >
>>>> >>>>> - Should not impact compile time excessively (what is
>>>> "excessive?")
>>>> >>>> Probably, such estimation should be performed on
>>>> >>> Did something get cut off here?
>>>> >> Uops. Yep, I removed a paragraph, but, apparentely I forgot the first
>>>> >> period. In any case, we should discuss about how to quantitatively
>>>> >> determine an acceptable upper-bound on the overhead on the
>>>> compilation
>>>> >> time and give a motivation for it. For instance, max n% overhead on
>>>> the
>>>> >> compilation time must be guaranteed, because ** list of reasons **.
>>>> > I am not sure how we'd arrive at such a number or motivate/defend it.
>>>> > Do we have any sense of the impact of the existing metadata
>>>> > infrastructure?  If not I'm not sure we can do it for something
>>>> > completely new.  I think we can set a goal but we'd have to revise it
>>>> as
>>>> > we gain experience.
>>>> I think it is the best approach to employ :)
>>>> >>> Since you initially raised the topic, do you want to take the lead
>>>> in
>>>> >>> writing up a RFC?  I can certainly do it too but I want to give you
>>>> >>> right of first refusal.  :)
>>>> >>>                     -David
>>>> >> Uhm...actually, it wasn't me but Son Tuan, so the right of refusal
>>>> >> should be granted to him :) And I noticed now that he wasn't
>>>> included in
>>>> >> CC of all our mails; I hope he was able to follow our discussion
>>>> >> anyways. I am adding him in this mail and let us wait if he has any
>>>> >> critical feature or point to discuss.
>>>> > Fair enough!  I have recently taken on a lot more work so
>>>> unfortunately
>>>> > I can't devote a lot of time to this at the moment.  I've got to clear
>>>> > out my pipeline first.  I'd be very happy to help review text, etc.
>>>> Do not worry, it is ok ;) Meanwhile we wait for any feedback/input from
>>>> Son,
>>>> I'll try to prepare a draft of RFC and publish it here.
>>>>
>>>> Thank you David, and have a nice day :)
>>>>
>>>> -- Lorenzo
>>>>
>>>> >                  -David
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210617/697e51a0/attachment.html>