[llvm-dev] Adding a new External Suite to test-suite

Mon Apr 6 20:49:53 PDT 2020

On 4/6/20 7:24 PM, Fernando Magno Quintao Pereira wrote:
 > Hi Johannes,
 >
 >> All the use cases sound reasonable but why do we need these kind of 
"weird files" to do this?
 >>
 >> I mean, why would you train or measure something on single definition
 >> translation units and not on the original ones, potentially one
 >> function at a time?
 >
 > I think that's the fundamental question :) The short answer is that it
 > is hard to compile the files from open-source repositories
 > automatically. The weird files that you mentioned appear due to the
 > type inference that we run on them. Let me give you some data and tell
 > you the whole story.
 >
 > One of the benchmark collections distributed in our website consists
 > of 529,498 C functions and their respective LLVM bytecodes. Out of
 > these files, we extracted 698,449 functions, sizes varying from one
 > line to 45,263 lines of code (Radare2's assembler). Thus, we produced
 > an initial code base of 698,449 C files, each file containing a single
 > function.

So there are a lot of functions, most of them small. This didn't really
answer my questions and instead explained how you got here. Why do we
want functions in isolation to begin with? How many unique functions are
there after normalization? How do we measure/account for the selection
bias of this entire scheme (= 30 second timeout, systematic problems
with the interference, shortcomings of the extractor or other tools,
...)?

 > We run Psyche-C (http://cuda.dcc.ufmg.br/psyche-c/) with a timeout of
 > 30 seconds on this code base. Psyche-C has been able to reconstruct
 > dependencies of 529,498 functions; thus, ensuring their compilation.
 > Compilation consists in the generation of an object file out of the
 > function.

TBH, I'm not even sure what this means. What dependences are you talking
about here?

 > Out of the 698,449 functions, 31,935 were directly compilable as-is,
 > that is, without type inference. To perform automatic compilation, we
 > invoke clang onto a whole C file. In case of success, we count as
 > compilable every function with a body within that file. Hence, without
 > type inference, we could ensure compilation of 4.6% of the programs.
 > With type inference, we could ensure compilation of 75,8% of all the
 > programs. Failures to reconstruct types were mostly due to macros that
 > were not syntactically valid in C without preprocessing. Only 3,666
 > functions could not be reconstructed within the allotted 30-second
 > time slot.
 >
 > So, we compile automatically less about 5% of the functions that we
 > download, even considering all the dependencies in the C files where
 > these functions exist. Nevertheless, given that we can download
 > millions of functions, 5% is already enough to give us a
 > non-negligible number of benchmarks. However, these compilable
 > functions tend to be very small. The median number of LLVM bytecodes
 > is seven (in contrast with >60 once we use type inference). Said
 > functions are unlikely to contain features such as arrays of structs,
 > type casts, recursive types, double pointer dereferences, etc.
 >
 >
 >> To me this looks like a really good way to skew the input data set,
 >> e.g., you don't ever see a call that can be inlined or for which
 >> inter-procedural reasoning is performed. As a consequence each
 >> function is way smaller than it would be in a real run, with all the
 >> consequences on the results obtained from such benchmarks. Again, why
 >> can't we take the original programs instead?
 >
 > Well, in the end, just using the compilable functions leads to poor
 > predictions. For instance, using these compilable functions, YaCoS
 > (it's the framework that we have been using) reduces the size of
 > MiBench's Bitcount by 10%, whereas using AnghaBench, it achieves
 > 16.9%. In Susan, the naturally compilable functions lead to an
 > increase of code size (5.4%), whereas AnghaBench reduces size by 1.7%.
 > Although we can find benchmarks in MiBench where the naturally
 > compilable  functions lead to better code reduction, these gains tend
 > to be very close to those obtained by AnghaBench, and seldom occur.

You are making my point here. You say, using even "simpler" functions that
"compile" by default might not be the best way to learn from code and
that using "more complex" functions can really help. Now make the same
argument but with different cutoffs for "simple" and "complex", e.g.,
these single functions with type inference applied are simple and real
code is complex. Sure, you need more "machinery" to compile real code,
but you already use (your research) machinery (Psyche-C) in the first
scenario.

One could also argue that the code came initially with build receipts we
should not ignore. With cmake we can generate compile commands, you can
use a wrapper to catch the invocations of the compiler, ...  why
shouldn't we then be able to compile the code (potentially after
preprocessing with the interceped options)?

 > About inlining, you are right: there will be no inlining. To get
 > around this problem, we also have a database of 15K whole files, which
 > contains files with multiple functions. The programs are available
 > here: 
http://cuda.dcc.ufmg.br/angha/files/suites/angha_wholefiles_all_15k.tar.gz

What is the benefit of the 700k single functions collection over this
one? Are these original source files or modified?

Cheers,
   Johannes

 > Regards,
 >
 > Fernando