[llvm-dev] Metadata in LLVM back-end

Wed Jun 16 13:42:32 PDT 2021

Thanks for the update, Lorenzo.

I have some free time to work on an RFC, but I'm unfamiliar with how the
implementation details would work.

If I dig through this thread and try to draft something, would you and/or
Son be willing to contribute?

Thanks,
Matt

On Wed, Jun 16, 2021 at 12:02 PM Lorenzo Casalino <
lorenzo.casalino93 at gmail.com> wrote:

> Hello Matt,
>
> I think that the RFC drafting went stale some months ago due to heavy
> workload on which all the partecipants were subject to.
>
> As of now, I do not know when the RFC will be actually drafted and sent.
>
> Cheers,
> Lorenzo
>
> Le 16 juin 2021 à 1:32 AM, Matt Morehouse <mascasa at google.com> a écrit :
>
> 
> Did anyone send an RFC for this?
>
> First-class metadata would be exceptionally useful for sanitizers and
> other dynamic tools.  For
> example, we want to construct PC-keyed metadata tables in the binary
> (without affecting the
> generated code), to inform program behavior at runtime or to allow offline
> analysis.  A
> prerequisite is to actually propagate the metadata we need from the Clang
> frontend or LLVM
> middle-end down to the assembly printer.
>
> Our team has brainstormed many use cases:
>
> - *GWP-TSan* <https://youtu.be/2KvaKEyMVEU>:  storing PCs of accesses
> lowered from C++ atomics, to filter them out from race
>   detection.
>   *  List<atomic access PC>
>
> - *Stack trace compression*:  storing a conservative call graph
> <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html>, for
> use in decompressing stack
>   traces offline.
>   * Map[callsite PC] -> List<callee PC>
>
> - *no_sanitize attributes*:  storing a map of functions that have the
> no_sanitize("...")
>   attribute to the associated sanitizer, for filtering out from GWP-*San.
> Ideally we do not
>   introduce new no_sanitize string literals, but simply rely on existing
> ones (e.g. a
>   no_sanitize("thread") works for both TSan but also GWP-TSan).
>   *  Map[Func] -> SanitizerKind
>
> - *Fuzzing aid/CFG reconstruction*:  marking coverage PCs as function
> entry/exit or # of
>   outgoing edges from BB (allows to find gaps in coverage frontier).
>
> - *Type-aware malloc and heap profiling*:  enable the allocator to get
> the type for a given new
>   call, to optimize for expected usage of the allocation.
>   *  Map[new callsite PC] -> object type
>
> - *Other*:  potential use cases for future bug-finding tools (GWP-assert,
> GWP-MSan,
>   GWP-DFSan, GWP-UBSan).
>
> First-class metadata would open the door to some really cool things.
>
> Thanks,
> Matt Morehouse
>
>
> On Wed, Jan 6, 2021 at 5:56 AM Lorenzo Casalino via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Dear Tuan,
>>
>> How are you doing? Did you manage to start the draft for the RFC?
>>
>>
>> I take this opportunity to wish you all the best for this new year :)
>>
>> Best regards,
>> Lorenzo Casalino
>> Le 10/11/20 à 09:27, Lorenzo Casalino a écrit :
>>
>>
>> Le 09/11/20 à 00:30, Son Tuan VU a écrit :
>>
>> Hi,
>>
>> Thank you all for keeping this going. Indeed I was not aware that the
>> discussion was going on, I am really sorry for this late reply.
>>
>> Nice to hear you again! Thank you for starting this thread ;)
>>
>> I understand Chris' point about metadata design. Either the metadata
>> becomes stale or removed (if we do not teach transformations to preserve
>> it), or we end up modifying many (if not all) transformations to keep the
>> data intact.
>> Currently in the IR, I feel like the default behavior is to ignore/remove
>> the metadata, and only a limited number of transformations know how to
>> maintain and update it, which is a best-effort approach.
>> That being said, my initial thought was to adopt this approach to the
>> MIR, so that we can at least have a minimal mechanism to communicate
>> additional information to various transformations, or even dump it to the
>> asm/object file.
>> In other words, it is the responsibility of the users who introduce/use
>> the metadata in the MIR to teach the transformations they selected how to
>> preserve their metadata. A common API to abstract this would definitely
>> help, just as combineMetadata() from lib/Transforms/Utils/Local.cpp does.
>>
>> Unfortunately, I never worked with the LLVM-IR Metadata (I almost focused
>> on the back-end
>> and I just scratched the LLVM's middle-end), but I see your point.
>>
>> Clearly, applying the needed modifications to all the back-end
>> transformations/optimizations
>> is unfeasible and, probably, not worth it -- different users may have
>> different requirements/needs
>> regarding a specific pass.
>>
>> I like the idea of a common API to handle the MIR metadata, and let the
>> end user handle
>> such data. Of course, if the community encounters common cases while
>> handling the metadata, such
>> cases may be integrated with the upstream project.
>>
>> Nonetheless, the main point of this thread is to preserve middle-end
>> metadata down to the
>> back-end, right after the Instruction Selection phase. Hence, despite the
>> need of the end user, a
>> "preserve-all" policy during the lowering stage is required, which will
>> involve a bit of changes,
>> in particular in the DAGCombine pass.
>>
>>
>> As for my use case, it is also security-related. However, I do not
>> consider the metadata to be a compilation "correctness" criteria: metadata,
>> by definition (from the LLVM IR), can be safely removed without affecting
>> the program's correctness.
>> If possible, I would like to have more details on Lorenzo's use case in
>> order to see how metadata would interfere with program's correctness.
>>
>> I would really like to discuss here the details, but, unfortunately, I am
>> working on a publication
>> and, thus, I cannot disclose any detail here :(
>>
>> However, with "correctness" I do not refer to "I/O correctness", but the
>> preservation of a
>> security property expressed in the front-end (e.g., specified in the
>> source-code) or in the
>> middle-end (e.g., specified in the LLVM-IR, for instance by a
>> transformation pass).
>>
>> From a security point-of-view, removing or altering metadata does not
>> interfere with the I/O
>> functionality of the code (although may impact on the performances), but
>> may introduce
>> vulnerabilities.
>>
>> As for the RFC, I can definitely try to write one, but this would be my
>> first time doing so. But maybe it is better to start with Lorenzo's
>> proposal, as you have already been working on this? Please tell me if you
>> prefer me to start the RFC though.
>>
>> It is the first time for me too, do not worry!
>>
>> We could just use any other RFC as a template to get started :D
>>
>> I think that a structure like the following would be fine:
>>
>>   1. Background
>>      1.1 Motivation
>>      1.2 Use-cases
>>      1.3 Other approaches
>>   2. Goal(s)
>>   3. Requirements
>>   4. Drawbacks and main bottlenecks
>>   5. Design sketch
>>   6. Roadmap sketch
>>   7. Potential future development
>>
>> It may be a bit overkill; you are warmly invited to cut/refine these
>> points!
>>
>> And...no, I still have no sketch of the RFC; sorry, I had a bit of
>> workload in these
>> days.
>>
>> Yes, you can start the write up of the RFC.
>>
>> Quoting David:
>>
>>   "Since you first raised the topic [...] I want to give you right of
>> first refusal."
>>
>>
>> Have a nice day!
>>
>> -- Lorenzo
>>
>> Thank you again for keeping this going.
>>
>> Sincerely,
>>
>> - Son
>>
>> On Wed, Nov 4, 2020 at 6:30 PM Lorenzo Casalino <
>> lorenzo.casalino93 at gmail.com> wrote:
>>
>>>
>>> Le 04/11/20 à 17:40, David Greene a écrit :
>>> > Sorry about the late reply.
>>> >
>>> > Lorenzo Casalino <lorenzo.casalino93 at gmail.com> writes:
>>> >
>>> >>>>> - Should not impact compile time excessively (what is "excessive?")
>>> >>>> Probably, such estimation should be performed on
>>> >>> Did something get cut off here?
>>> >> Uops. Yep, I removed a paragraph, but, apparentely I forgot the first
>>> >> period. In any case, we should discuss about how to quantitatively
>>> >> determine an acceptable upper-bound on the overhead on the compilation
>>> >> time and give a motivation for it. For instance, max n% overhead on
>>> the
>>> >> compilation time must be guaranteed, because ** list of reasons **.
>>> > I am not sure how we'd arrive at such a number or motivate/defend it.
>>> > Do we have any sense of the impact of the existing metadata
>>> > infrastructure?  If not I'm not sure we can do it for something
>>> > completely new.  I think we can set a goal but we'd have to revise it
>>> as
>>> > we gain experience.
>>> I think it is the best approach to employ :)
>>> >>> Since you initially raised the topic, do you want to take the lead in
>>> >>> writing up a RFC?  I can certainly do it too but I want to give you
>>> >>> right of first refusal.  :)
>>> >>>                     -David
>>> >> Uhm...actually, it wasn't me but Son Tuan, so the right of refusal
>>> >> should be granted to him :) And I noticed now that he wasn't included
>>> in
>>> >> CC of all our mails; I hope he was able to follow our discussion
>>> >> anyways. I am adding him in this mail and let us wait if he has any
>>> >> critical feature or point to discuss.
>>> > Fair enough!  I have recently taken on a lot more work so unfortunately
>>> > I can't devote a lot of time to this at the moment.  I've got to clear
>>> > out my pipeline first.  I'd be very happy to help review text, etc.
>>> Do not worry, it is ok ;) Meanwhile we wait for any feedback/input from
>>> Son,
>>> I'll try to prepare a draft of RFC and publish it here.
>>>
>>> Thank you David, and have a nice day :)
>>>
>>> -- Lorenzo
>>>
>>> >                  -David
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210616/a5417e2d/attachment-0001.html>