[LLVMdev] Adding ClamAV to the llvm testsuite (long)

Fri Dec 14 12:30:09 PST 2007

Hi,

I see that you are looking for new programs for the testsuite, as
described in 'Compile programs with the LLVM compiler', and 'Adding
programs to the llvm testsuite" on llvm.org/OpenProjects.

My favourite "C source code" is ClamAV (www.clamav.net), and I would
like to get it included in the testsuite.

This mail is kind of long, but please bear with me, as I want to clarify
how to best integrate Clamav into LLVM-testsuite's buildsystem.

Why include it?

It can be useful to find regressions, or new bugs; it already uncovered
a few bugs in llvm's cbe, llc, and optimizers that I've reported through
bugzilla (and they've been mostly fixed very fast! Thanks!). ClamAV was
also the "victim" of a bug in gcc 4.1.0's optimizer [see 9)]

It can be useful to test new/existing optimizations. There aren't any
significant differences on its performance when compiled by different
compilers (gcc, icc, llvm-gcc), so I hope LLVM's optimizers can (in the
future) make it faster ;)

I had a quick look at the build infrastructure, and there are some
issues with getting it to work with programs that use autoconf (such as
ClamAV), since AFAICT testsuites aren't allowed to run configure (listed
below)

Building issues aside there are some more questions:
* ClamAV is GPL (but it includes BSD, LGPL parts), ok for testsuite?
* what version to use? Latest stable, or latest svn?
[In any case I'll wait till the next stable is published, it should be
happening *very soon*]
* what happens if you find bugs that also cause it to fail under gcc
(unlikely) ? [I would prefer to get an entry on clamav's bugzilla then, 
with something in its subject saying it is llvm-testsuite related]
* what happens if it only fails under llvm-gcc/llc/clang,.. and it is
not due to a bug in llvm, but because of portability issues in the
source code (unlikely)?
I would prefer a clamav bugzilla here too, clamav is meant to be
"portable" :)

Also after I have set it up in the llvm testsuite, is there an easy way
to run clang on it? Currently I have to hack autoconf generated
makefiles if I want to test clang on it.

1. I've manually run, and generated a clamav-config.h.
This usually just contains HAVE_* macros for headers, which should all
be available on a POSIX system, so it shouldn't be a problem from this
perspective for llvm's build farm.
However there are some target specific macros:
#define C_LINUX 1
#define FPU_WORDS_BIGENDIAN 0
#define WORDS_BIGENDIAN 0
Also SIZEOF_INT, SIZEOF_LONG,... but they are only used if the system
doesn't have a proper <stdint.h>
Also not sure of this:
/* ctime_r takes 2 arguments */
#define HAVE_CTIME_R_2 1

What OS and CPU do the machines on llvm's buildfarm have? We could try a
config.h that works on Linux (or MacOSX), and try to apply
that to all, though there might be (non-obvious) failures.

Any solutions to having these macros defined in the LLVM testsuite
build? (especially for the bigendian macro)

2. AFAICT the llvm-testsuite build doesn't support a program that is
built from multiple subdirectories.
 libclamav has its source split into multiple subdirectories, gathering
those into one also requires changing #include that have relative paths.
I also get files with the same name but from different subdirs, so I
have to rename them to subdir_filename, and do that in #include too.

I have done this manually, and it works (native, llc, cbe work).
I could hack together some perl script to do this automatically, or is
there a better solution?

3. Comparing output: I've written a small script that compares the
--debug output, because it needs some adjustments since I also get
memory addresses in the --debug output that obviously don't match up
between runs.
There isn't anything else to compare besides --debug output (besides
ClamAV saying no virus found), and that can be a fairly good test.

4. What is the input data?
Clamav is fast :)
It needs a lot of input data if you want to get reasonable timings out
of it (tens, hundreds of MB).
Scanning multiple small files will be I/O bound, and it'd be mostly
useless as a benchmark (though still useful for testing
compiler/optimization correctness).

So I was thinking of using some large files already available in the
testsuite (oggenc has one), and then maybe point it to scan the last
*stable* build of LLVM. Or find some files that are scanned slowly, but
that don't presume lots of disk I/O (an archive, with ratio/size limits
disabled, with highly compressable data).
You won't be able to benchmark clamav in a "real world" scenario though,
since that'd involve making it scanning malware, and I'm sure you don't
want that on your build farm.

You could give it to scan random data, but you'll need it to be
reproducible, so scanning /dev/random, or /bin of current LLVM tree is
not a good choice ;)

There's also the problem of eliminating the initial disk I/O time out of
the benchmark, like rerun 3 times automatically or something like that?

5. Library dependencies
It needs zlib, all the rest is optional (bzip2, gmp, ....). I think I
can reasonably assume zlib is available on all systems where the
testsuite is run.

6. Sample output on using 126Mb of data as input:

$ make TEST=nightly report
....
Program  | GCCAS  Bytecode LLC compile LLC-BETA compile JIT codegen |
GCC     CBE     LLC     LLC-BETA JIT | GCC/CBE GCC/LLC GCC/LLC-BETA
LLC/LLC-BETA
clamscan | 7.0729 2074308  *           *                *           |  
17.48   17.55   18.81 *        *   | 1.00    0.93    n/a          n/a

7. Clamav is multithreaded
If you're interested in testing if llvm-generated code works when
multithreaded (I don't see why it wouldn't, but we're talking about a
testsuite), you'd need to start the daemon (as an unprivileged user is
just fine), and then connect to it.
Is it possible to tell the testsuite build system to do this?

8. Code coverage
Testing all of clamav code with llvm is ... problematic. Unless you
create files with every packer/archiver known to clamav it is likely
there will be files that are compiled in but never used during the
testsuite run. You can still test that these files compile, but thats it.

9. Configure tests
Configure has 3 tests that check for gcc bugs known to break ClamAV (2
of which you already have, since those are in gcc's testsuite too). Add
as separate "programs" to run in llvm testsuite?

Thoughts?

Best regards,
Edwin