[llvm-dev] Adding a new External Suite to test-suite

Mon Apr 6 17:24:32 PDT 2020

Hi Johannes,

> All the use cases sound reasonable but why do we need these kind of "weird files" to do this?
>
> I mean, why would you train or measure something on single definition translation units and not on the original ones, potentially one function at a time?

I think that's the fundamental question :) The short answer is that it
is hard to compile the files from open-source repositories
automatically. The weird files that you mentioned appear due to the
type inference that we run on them. Let me give you some data and tell
you the whole story.

One of the benchmark collections distributed in our website consists
of 529,498 C functions and their respective LLVM bytecodes. Out of
these files, we extracted 698,449 functions, sizes varying from one
line to 45,263 lines of code (Radare2's assembler). Thus, we produced
an initial code base of 698,449 C files, each file containing a single
function.

We run Psyche-C (http://cuda.dcc.ufmg.br/psyche-c/) with a timeout of
30 seconds on this code base. Psyche-C has been able to reconstruct
dependencies of 529,498 functions; thus, ensuring their compilation.
Compilation consists in the generation of an object file out of the
function.

Out of the 698,449 functions, 31,935 were directly compilable as-is,
that is, without type inference. To perform automatic compilation, we
invoke clang onto a whole C file. In case of success, we count as
compilable every function with a body within that file. Hence, without
type inference, we could ensure compilation of 4.6% of the programs.
With type inference, we could ensure compilation of 75,8% of all the
programs. Failures to reconstruct types were mostly due to macros that
were not syntactically valid in C without preprocessing. Only 3,666
functions could not be reconstructed within the allotted 30-second
time slot.

So, we compile automatically less about 5% of the functions that we
download, even considering all the dependencies in the C files where
these functions exist. Nevertheless, given that we can download
millions of functions, 5% is already enough to give us a
non-negligible number of benchmarks. However, these compilable
functions tend to be very small. The median number of LLVM bytecodes
is seven (in contrast with >60 once we use type inference). Said
functions are unlikely to contain features such as arrays of structs,
type casts, recursive types, double pointer dereferences, etc.

> To me this looks like a really good way to skew the input data set, e.g., you don't ever see a call that can be inlined or for which inter-procedural reasoning is performed. As a consequence each function is way smaller than it would be in a real run, with all the consequences on the results obtained from such benchmarks. Again, why can't we take the original programs instead?

Well, in the end, just using the compilable functions leads to poor
predictions. For instance, using these compilable functions, YaCoS
(it's the framework that we have been using) reduces the size of
MiBench's Bitcount by 10%, whereas using AnghaBench, it achieves
16.9%. In Susan, the naturally compilable functions lead to an
increase of code size (5.4%), whereas AnghaBench reduces size by 1.7%.
Although we can find benchmarks in MiBench where the naturally
compilable  functions lead to better code reduction, these gains tend
to be very close to those obtained by AnghaBench, and seldom occur.

About inlining, you are right: there will be no inlining. To get
around this problem, we also have a database of 15K whole files, which
contains files with multiple functions. The programs are available
here: http://cuda.dcc.ufmg.br/angha/files/suites/angha_wholefiles_all_15k.tar.gz

Regards,

Fernando