[llvm-dev] The AnghaBench collection of compilable programs

Sat Feb 22 06:55:03 PST 2020

Dear LLVMers,

    we, at UFMG, have been building a large collection of compilable
benchmarks. Today, we have one million C files, mined from open-source
repositories, that compile into LLVM bytecodes (and from there to
object files). To ensure compilation, we perform type inference on the
C programs. Type inference lets us replace missing dependencies.

The benchmarks are available at: http://cuda.dcc.ufmg.br/angha/

We have a technical report describing the construction of this
collection: http://lac.dcc.ufmg.br/pubs/TechReports/LaC_TechReport012020.pdf

Many things can be done with so many LLVM bytecodes. A few examples
follow below:

* We can autotune compilers. We have trained YaCoS, a tool used to
find good optimization sequences. The objective function is code size.
We find the best optimization sequence for each program in the
database. To compile an unknown program, we get the program in the
database that is the closest, and apply the same optimization
sequence. Results are good: we can improve on clang -Oz by almost 10%
in MiBench, for instance.

* We can perform many types of explorations on real-world code. For
instance, we have found that 95.4% of all the interference graphs of
these programs, even in machine code (no phi-functions and lots of
pre-colored registers), are chordal.

* We can check how well different tools are doing on real-world code.
For instance, we can use these benchmarks to check how many programs
can be analyzed by Ultimate Buchi Automizer
(https://ultimate.informatik.uni-freiburg.de/downloads/BuchiAutomizer/).
This is a tool that tries to prove termination or infinite execution
for some programs.

* We can check how many programs can be compiled by different
high-level synthesis tools into FPGAs. We have tried LegUp and Vivado,
for instance.

* Our webpage contains a search box, so that you can get the closest
programs to a given input program. Currently, we measure program
distance as the Euclidian distance on Namolaru feature vectors.

We do not currently provide inputs for those programs. It's possible
to execute the so called "leaf-functions", e.g., functions that do not
call other routines. We have thousands of them. However, we do not
guarantee the absence of undefined behavior during the execution.

Regards,

Fernando