[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Tue May 3 10:25:16 PDT 2016

----- Original Message -----
> From: "Adam Nemet via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "Jonas Hahnfeld" <Hahnfeld at itc.rwth-aachen.de>
> Cc: "llvm-dev (llvm-dev at lists.llvm.org)" <llvm-dev at lists.llvm.org>
> Sent: Tuesday, May 3, 2016 12:21:07 PM
> Subject: Re: [llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer
> 
> 
> > On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev
> > <llvm-dev at lists.llvm.org> wrote:
> > 
> > Hello all,
> > 
> > I've been wondering why Clang doesn't generate non-temporal stores
> > when
> > compiling the STREAM benchmark [1] and therefore doesn't yield
> > optimal
> > results.
> > 
> > It turned out that the Loop Vectorizer correctly vectorizes the
> > arithmetic
> > operations and also merges the loads and stores into vector
> > operations.
> > However it doesn't add the '!nontemporal' metadata which would be
> > needed for
> > maximal bandwidth on X86.
> > I briefly looked into this and for non-temporal memory instructions
> > to work,
> > the memory address would have to be aligned to the vector length
> > which
> > currently isn't the case neither.
> > 
> > To summarize the following things would be needed to give
> > non-temporal
> > hints:
> > 1) Ensure correct alignment of merged vector memory instructions
> > This could be implemented by executing the first (scalar) loop
> > iterations
> > until the addresses for loads and stores are aligned, similar to
> > what already
> > happens for the remainder of the loop. The larger alignment would
> > also allow
> > aligned vector instructions instead of the currently unaligned
> > ones.
> > 
> > 2) Give non-temporal hints when different array elements are only
> > used once
> > per loop iteration
> > We probably need to analyze the different load and stores per loop
> > iteration
> > for this…
> 
> You probably also want to ensure that you stay in the loop long
> enough, i.e. have some sort of a dynamic-trip count check or PGO
> data indicating this.

This sounds right. Also, I'll point out that LLVM essentially does not have a memory-hierarchy model based on which such decisions could be made. Work in this area would be welcome.

 -Hal

> You essentially want to ensure that reads after the loop were not
> hitting in the cache even with regular stores.  (If you are writing
> a large area in the loop, a large percentage of those lines are
> already evicted by the time you exit the loop.)
> 
> Adam
> 
> > 
> > Any thoughts or any ongoing work that I'm missing?
> > 
> > Thanks,
> > Jonas
> > 
> > 
> > [1] https://www.cs.virginia.edu/stream/
> > 
> > --
> > Jonas Hahnfeld, MATSE-Auszubildender
> > 
> > IT Center
> > Group: High Performance Computing
> > Division: Computational Science and Engineering
> > RWTH Aachen University
> > Seffenter Weg 23
> > D 52074  Aachen (Germany)
> > Hahnfeld at itc.rwth-aachen.de
> > www.itc.rwth-aachen.de
> > 
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory