PATCH: Make TSVC benchmarks static data layout more predictable (PR14076)

Sat Apr 6 14:01:35 PDT 2013

----- Original Message -----
> From: "Daniel Dunbar" <daniel at zuster.org>
> To: "Hal Finkel" <hfinkel at anl.gov>, "Jakob Stoklund Olesen" <stoklund at 2pi.dk>, "Commit Messages and Patches for LLVM"
> <llvm-commits at cs.uiuc.edu>
> Sent: Saturday, April 6, 2013 2:40:41 PM
> Subject: PATCH: Make TSVC benchmarks static data layout more predictable (PR14076)
> 
> 
> Hi Hal,
> 
> 
> As currently written, the performance of the TSVC benchmarks can
> depend very heavily on the exact address assignment of the global
> data arrays. Since that is something we do not (and are not likely
> anytime soon) model in the compiler, this makes them suboptimal as
> they are not testing what they purport to be.
> 
> 
> The attached patch fixes this problem by making the static data
> layout more predictable by moving all of the data arrays into a
> single structure, so that the relative addresses are stable across
> all platforms.
> 
> 
> In particular, this resolves:
> http://llvm.org/bugs/show_bug.cgi?id=14076
> 
> because, at least on OS X but presumably other architectures, two of
> the arrays in some of the benchmarks are very likely to end up at
> exact 4K offsets from each other. This causes severe performance
> degradation on some Intel platforms, in the worst case on
> StatementReordering-flt this can double the runtime of the
> benchmark. See:
> http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm
> for more information.
> 
> 
> Even worse, on some Intel architectures (Sandy Bridge, at least),
> when this problem is hit the runtime of the benchmark is no longer
> predictable and can vary by up to 100% run-to-run (!!!).
> 
> 
> I wrote the patch in such a way that I don't think it should impair
> the compilers ability to perform any vectorization optimizations,
> but wanted to run it past you before committing it.
> 
> 
> What do you think?

First, thanks for working on this! I think using your patch will necessitate runtime overlap checks on any vectorization (because there is no way to otherwise determine without IPO than the pointers point to disjoint memory regions). This will also cut out a lot of BB-vectorization opportunities. Clang does not respect restrict on non-function parameters, right? In that case, we might need to pass the arrays to each function through restrict parameters.

 -Hal

> 
> 
> I attached a snapshot from a before and after run with the patch
> applied, showing the graphs from some of the benchmarks in the TSVC
> suite that trigger this problem. Each run was done with 5 samples
> each, and as you can see in the old version, the runtime of the
> benchmarks is highly variable.
> 
> 
> Thanks,
> - Daniel
> =