[llvm-dev] [RFC] Embedded bitcode and related upstream (Part II)

Tue Jun 14 18:13:24 PDT 2016

On Tue, Jun 14, 2016 at 6:09 PM Adrian Prantl <aprantl at apple.com> wrote:

> On Jun 13, 2016, at 9:37 AM, Steven Wu <stevenwu at apple.com> wrote:
>
> Thanks for the feedback! Replies inline.
>
> On Jun 12, 2016, at 11:44 PM, Eric Christopher <echristo at gmail.com> wrote:
>
> Hi Steven,
>
> Great to see the commentary and updates here. I've got a few questions
> about some of this work. It might be nice to see some separate RFCs for a
> couple of things, but we'll figure that out after you send out patches
> probably :)
>
> What needs to be improved:
>> 1. Whitelist for command line options that can be used with bitcode:
>> Current trunk implementation embeds all the cc1 command line options
>> (that includes header include paths, warning flags and other front-end
>> options) in the command line section. That is lot of redundant information.
>> To re-create the object file from the embedded optimized bitcode, most of
>> these options are useless. On the other hand, they can leak information of
>> the source code. One solution will be keeping a list of all the options
>> that can affect code generation but not encoded in the bitcode. I have
>> internally prototyped with disallowing these options explicitly and allowed
>> only the reminder of the  options to be embedded (
>> http://reviews.llvm.org/D17394). A better solution might be encoding
>> that information in "Options.td" as specific group.
>>
>
> This is really interesting. I'm not a particularly security minded person
> so I don't have a lot of commentary there. An explicit whitelist sounds a
> bit painful to keep maintained, explicitly having a group in Options.td
> sounds pretty nice. You'll need to add them to multiple groups, but it
> seems pretty nice.
>
>
> I have already implemented the new approach in
> http://reviews.llvm.org/D21230. It creates a new group for all the cc1
> options that can affect codegen but not having a corresponding attribute in
> the bitcode. When I wrote up this patch, I think it is also a good idea to
> extend the group to driver flags so clang driver can issue warnings when
> using these flags with LTO because they are likely to be dropped in the
> process. That is my next thing to do if someone reviews my patch and agrees
> that is right thing to do.
>
>
>
>> 2. Assembly input handling:
>> This is a workaround to allow source code written in assembly to work
>> with "-fembed-bitcode" options. When compiling assembly source code with
>> "-fembed-bitcode", clang-as creates an empty section "__LLVM, __asm" in the
>> object file. That is just a way to distinguish object files compiled from
>> assembly source from those compiled from higher level source code but
>> forgot to use "-fembed-bitcode" options. Linker can use this section to
>> diagnose if "-fembed-bitcode" is consistently used on all the object files
>> participated in the linking.
>>
>
> I'm surprised you want a separate and empty section and not a header flag
> as those are easier to keep around and won't take up a precious mach-o
> section. There are probably other options here as well. There are probably
> other options or concerns that someone shipping bitcode might have here as
> well, but I'm sure those are being talked about - doesn't have too much
> affect on the community though.
>
>
> I suppose you mean the alternative is to burn a macho command for that.
> Well, that is a limited resource and we don't have much left. Plus, using
> empty section will make this accessible to other binary format, not only
> macho files. I also have an interesting thought about handle the assembly,
> that is to wrap it in module assembly in a bitcode file. I am not sure it
> would preserve the all semantics of the original assembly and that would
> mean I need to somehow teach the assembler about bitcode (which might make
> this not very attractive). Yes, you might be right this doesn't affect the
> community, if no one else is interesting in a solution for the problem we
> have, then this might not be suitable for contributing. I am happy to keep
> it downstream.
>
>
> 3. Bitcode symbol hiding:
>> There was some concerns for leaking source code information when using
>> bitcode feature. One approach to avoid the leak is to add a pass which
>> renames all the globals and metadata strings. The also keeps a reverse map
>> in case the original name needs to be recovered. The final bitcode should
>> contain no more symbols or debug info than a stripped binary. To make sure
>> modified bitcode can still be linked correctly, the renaming need to be
>> consistent across all bitcode participated in the linking and everything
>> that is external of the linkage unit need to be preserved. This means the
>> pass can only be run during the linking and requires some LTO api.
>>
>
> How are you planning to ensure the safety of the reverse map? Seems that
> requiring linking is a bit icky, but might work. Are you mostly worried
> about function names that could be stripped out? What LTO api are you
> envisioning here?
>
>
> The reverse map is emitted as a separate file from the output
> binary/bitcode. It should not be shipped together with the binary output,
> just like dSYM bundle.
> The reason it needs to be done after linking is a limitation of the symbol
> hiding technique. It requires that the symbols must be resolved. Think
> about the following case:
> a.o:
> T export_symbol
> T global_symbol
> t local_symbol
> b.o:
> U global_symbol
>
> To make sure the bitcode after symbol hiding pass can still link and
> produce the same output, the pass need to rename them:
> a.o:
> T export_symbol    --> export_symbol (preserve)
> T global_symbol    --> hidden_symbol_1 (rename, but need to have the same
> name as the one in b.o)
> t local_symbol       --> hidden_symbol_2 (rename, but don't care what it
> becomes)
> b.o:
> U global_symbol   --> hidden_symbol_1
> The pass need to know what symbols to keep and a global renaming table so
> the names after renaming are consistent across all the modules.
>
>
>
>
>> 4. Debug info strip to line-tables pass:
>> As the name suggested, this pass strip down the full debug info to
>> line-tables only. This is also one of the steps we took to prevent the leak
>> of source code information in bitcode.
>>
>
> I'm very curious about what's going on here. Could you elaborate? :)
>
>
> Cc Adrian
> He would know more about it. I would only know that it can reconstruct
> -gline-tables-only debug info from full debug info. We use it as a part of
> the bitcode pipeline because we don't want the bitcode file to be
> exceedingly large but I can see this pass to be useful in other
> circumstances.
>
>
> It’s a pass that aims at downgrading full debug info to -gline-tables-only
> debug info. It performs a deep copy of the debug metadata graph, removing
> all types and variables while leaving all locations, subprograms, and
> inline information intact. It then makes the IR point to the newly created
> stripped subprograms, CU, and locations.
>
>
Sounds like what I'd expect.

How are you planning on hooking all of these up?

-eric

> -- adrian
>
>
> Steven
>
>
> Thanks a ton for the update - glad to see this being worked on!
>
> -eric
>
>
>>
>> Please let me know what do you think about the pieces above or if you
>> have any concerns about the methodology. I will put up patches for review
>> soon.
>>
>> Thanks
>>
>> Steven
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160615/ac550396/attachment-0001.html>