[cfe-dev] [RFC] Embedding compilation database info in object files.

Mon Jul 22 19:26:54 PDT 2013

On Mon, Jul 22, 2013 at 4:39 PM, Joshua Cranmer <pidgeot18 at gmail.com> wrote:

>  On 7/22/2013 5:12 PM, Sean Silva wrote:
>
> On Mon, Jul 22, 2013 at 2:27 PM, Joshua Cranmer <pidgeot18 at gmail.com>wrote:
>
>> On 7/22/2013 3:26 PM, Sean Silva wrote:
>>
>>> In dealing with game teams, each one may use a different (possibly
>>> custom/private) build system/mashup of build systems, many of which are
>>> closed source/proprietary (e.g. MSBuild.exe from Visual Studio). I'm trying
>>> to come up with a solution that will work independently of the build
>>> system, or at least with as few assumptions as possible (things like "they
>>> have access to their final build products, since otherwise how would they
>>> run them" and "they can modify the compiler flags"). Like I said in the OP,
>>> I was able to rapidly extract a compilation database from a completely
>>> unfamiliar (closed-source, proprietary) build system (that I still don't
>>> understand!).
>>>
>>
>>  The implicit assumptions for your approach amount to the following:
>> 1. The user can make their build system use clang.
>> 2. The user can make their build system add compiler flags to clang.
>> 3. The user can find all of the final build products.
>> 4. The build system does not mutilate binaries for the final build
>> products in a way that would render this unnecessary, or if this is false,
>> the build system retains an intermediate copy of the products that has not
>> yet been mutilated, and these intermediate copies can be checked.
>> 5. The binary targets are capable of having this information, and capable
>> of having this information extracted from this easily.
>> 6. Adding this extra information would not cause the build system to fail.
>> 7. The user is willing to add all of this extra information to their
>> final build products, or to apply a post-processing step to extract all of
>> this extra information.
>> 8. The set of all build steps may be found in the union of all final
>> build products.
>>
>> Number 3 can be less trivial than it seems, particularly if you don't
>> think to add de-duplication steps.
>
>
>  I think I have already adequately addressed this issue. See my responses
> to Manuel and David Blaikie.
>
>
>> Number 4 is definitely not universally true (I've used some build systems
>> which mutilate the final product into a custom binary format)--and may be
>> generally false in the embedded world.
>
>
>  I have already presented one scheme where the data is embedded as a
> single, otherwise-unremarkable string literal. Consider the absurdity of a
> scenario where the build system can fail to carry a string literal though
> into the final build product (for all it knows, it is being used as the
> argument to printf). The only case I can think of that would make this
> difficult is one where the final build product is being e.g.
> compressed/encrypted, in which case an uncompressed/unencrypted version is
> highly likely to be around in a well-defined location (do you know of any
> real setup where this is not the case?).
>
>
> It depends on what you mean and what you consider "valid" as a build
> system. In one scenario, what I'd have access to is this:
> <http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-central-android-x86/1374500809/><http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-central-android-x86/1374500809/>.
> The main binaries are compressed via some format that file is not able to
> tell me.
>

Interesting. My `file` says the .apk is just a zip file (which it is).
Anyway, googling for "android apk" rapidly identifies it (took me less than
5 minutes to get everything unpacked (it did require a recursive unzip)).
Auto-extracting common archives (recursively) is probably something that we
would want to do in the "worst case, find everything" tool (again, this is
like a dozen lines of python to recursively unzip files). If the build you
linked to had been built embedding compilation database info in the build
products, I would probably be able to attach a compile_commands.json to
this very message! Compare this to the alternative, which would be to find
the configuration that controls which files end up on this website (may not
even be a build script checked into the repo), and ensure that the
compile_commands.json file (or whatever) gets correctly added to it. Again,
I'm not saying that this is a turnkey solution for every case; I'm saying
that it's a consistent (huge) simplification of the problem space that
applies basically everywhere, and automating the last mile to get the
compile_commands.json is a matter of simple scripts.

>
>
>
>
>> Number 5 I think may be iffy, and I can think of situations to make
>> number 6 not true.
>
> Again, I'm not aware of any build system that will not respect a string
> literal that the compiler embeds as being necessary. The number of
> scenarios where the scheme will work is therefore a superset of the cases
> where printf is available, for example.
>
>
> I think there are tools that strip unused symbols from binaries, and I
> suspect these would be fairly widely used in embedding toolchains. My
> limited experience with such toolchains leads me to the conclusion that
> mucking with binaries in any fashion isn't going to reliably solve the
> build-chain.
>

The fact of the matter is that clang can emit a string in a call to printf
(or whatever) that definitely won't get removed. Since clang would be
adding this information itself, it can embed the string in the same way to
ensure it isn't removed. A tool that recursively walks an entire directory
tree would still work in the case of embedded; basically by definition
software with such tight size constraints that needs this kind of step is
going to be a simple local build (it isn't big enough to require
distributed), and so you can just walk the build directory.

>
>
>
>
>  I know of at least one real use case where 3 and 4 are not met (I can
> elaborate somewhat, but it is internal so the description will have be made
> in appropriately broad strokes). If you want to deliberately exclude that
> use case from consideration, please state that explicitly. It may be that
> these different ideas cover different subsets of the possible build
> configurations (although I have yet to be presented with a real scenario
> where embedding the info in build products will simply not work, but the
> "write to a file on the side" one will).
>
>
> In my experience, it is relatively easy to get even a hostile build system
> you have little control over to give you the contents of an extra file at
> the end (this has included such steps as "cat a tarball to stdout" and
> using a tool to extract the data from uploaded log files). Indeed, if you
> *can* build locally, then log-to-file *will* work.
>

You seem to tout no-postprocessing as an important advantage of this
approach, so do you intend to atomically update a .json file containing a
JSON array? I suppose it would work, but it would be almost farcical (seek
to the end, read backwards to find the closing `]`, etc.) and possibly
require cross-platform maintenance for the file locking parts.

> On the other hand, emitting stuff into binaries assumes that build systems
> won't munge binaries or will at least leave unmunged binaries lying around
> (I have very little faith in build systems), that this data can be reliably
> extracted from binaries, all in the hopes that it will work on some limited
> cases where the build system is completely unable to build locally.
>
> You have yet to concretely describe any such "munging" procedure which
cannot be worked around with simple local scripting.

Regardless, I think the issue of forming a consistent view of the project
across configurations is a much harder problem that neither of our
suggested approaches really deals with. For example, surely there is code
that the compiler only sees for that android build; how to join that
information with all the other configurations (mac build, windows build,
stock linux build, etc.) that each have their own #ifdef'd code to ensure
that, say, renaming a variable actually renames all uses/declarations.

It's really not clear to me how to do that reliably; do you have any
suggestions? Would simply unioning all the compilation databases be enough
(and hoping you cover all cases)? How to deal with "user-configurable"
options (compared to "platform" options, which would probably at least have
buildbot coverage)?

-- Sean Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20130722/f58fdc9d/attachment.html>