PATCH: Make TSVC benchmarks static data layout more predictable (PR14076)

Sat Apr 6 18:51:41 PDT 2013

----- Original Message -----
> From: "Daniel Dunbar" <daniel at zuster.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Jakob Stoklund Olesen" <stoklund at 2pi.dk>, "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Saturday, April 6, 2013 4:17:49 PM
> Subject: Re: PATCH: Make TSVC benchmarks static data layout more predictable (PR14076)
> 
> 
> On Sat, Apr 6, 2013 at 2:01 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> 
> 
> 
> 
> ----- Original Message -----
> > From: "Daniel Dunbar" < daniel at zuster.org >
> > To: "Hal Finkel" < hfinkel at anl.gov >, "Jakob Stoklund Olesen" <
> > stoklund at 2pi.dk >, "Commit Messages and Patches for LLVM"
> > < llvm-commits at cs.uiuc.edu >
> > Sent: Saturday, April 6, 2013 2:40:41 PM
> > Subject: PATCH: Make TSVC benchmarks static data layout more
> > predictable (PR14076)
> > 
> > 
> > Hi Hal,
> > 
> > 
> > As currently written, the performance of the TSVC benchmarks can
> > depend very heavily on the exact address assignment of the global
> > data arrays. Since that is something we do not (and are not likely
> > anytime soon) model in the compiler, this makes them suboptimal as
> > they are not testing what they purport to be.
> > 
> > 
> > The attached patch fixes this problem by making the static data
> > layout more predictable by moving all of the data arrays into a
> > single structure, so that the relative addresses are stable across
> > all platforms.
> > 
> > 
> > In particular, this resolves:
> > http://llvm.org/bugs/show_bug.cgi?id=14076
> > 
> > because, at least on OS X but presumably other architectures, two
> > of
> > the arrays in some of the benchmarks are very likely to end up at
> > exact 4K offsets from each other. This causes severe performance
> > degradation on some Intel platforms, in the worst case on
> > StatementReordering-flt this can double the runtime of the
> > benchmark. See:
> > http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm
> > for more information.
> > 
> > 
> > Even worse, on some Intel architectures (Sandy Bridge, at least),
> > when this problem is hit the runtime of the benchmark is no longer
> > predictable and can vary by up to 100% run-to-run (!!!).
> > 
> > 
> > I wrote the patch in such a way that I don't think it should impair
> > the compilers ability to perform any vectorization optimizations,
> > but wanted to run it past you before committing it.
> > 
> > 
> > What do you think?
> 
> First, thanks for working on this! I think using your patch will
> necessitate runtime overlap checks on any vectorization (because
> there is no way to otherwise determine without IPO than the pointers
> point to disjoint memory regions). This will also cut out a lot of
> BB-vectorization opportunities. Clang does not respect restrict on
> non-function parameters, right? In that case, we might need to pass
> the arrays to each function through restrict parameters.
> 
> 
> 
> Hmm, you are probably right. The patch was silly though, there is no
> need to runtime initialize the global array pointers. Attached a
> revised version which just statically initializes them to point into
> the global structure; AA should be able to look through them now and
> see they don't alias. I verified that AA did this on a trivial
> function, and I generated the IR with -O3 -fvectorize before and
> after and spot checked that structurally the same things are going
> on. Look ok?

Sounds good. Please commit.

Thanks again,
Hal

> 
> 
> - Daniel
> 
> 
> 
> -Hal
> 
> 
> 
> > 
> > 
> > I attached a snapshot from a before and after run with the patch
> > applied, showing the graphs from some of the benchmarks in the TSVC
> > suite that trigger this problem. Each run was done with 5 samples
> > each, and as you can see in the old version, the runtime of the
> > benchmarks is highly variable.
> > 
> > 
> > Thanks,
> > - Daniel
> > =
> 
>