[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Adam Nemet via llvm-dev llvm-dev at lists.llvm.org
Tue May 3 10:29:19 PDT 2016


> On May 3, 2016, at 10:21 AM, Adam Nemet <anemet at apple.com> wrote:
> 
> 
>> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> Hello all,
>> 
>> I've been wondering why Clang doesn't generate non-temporal stores when
>> compiling the STREAM benchmark [1] and therefore doesn't yield optimal
>> results.
>> 
>> It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
>> operations and also merges the loads and stores into vector operations.
>> However it doesn't add the '!nontemporal' metadata which would be needed for
>> maximal bandwidth on X86.

Also MichaelZ introduced builtins last year to manually force the generation of non-temporal loads and stores: __builtin_nontemporal_load/store.  I believe these are documented.

>> I briefly looked into this and for non-temporal memory instructions to work,
>> the memory address would have to be aligned to the vector length which
>> currently isn't the case neither.
>> 
>> To summarize the following things would be needed to give non-temporal
>> hints:
>> 1) Ensure correct alignment of merged vector memory instructions
>> This could be implemented by executing the first (scalar) loop iterations
>> until the addresses for loads and stores are aligned, similar to what already
>> happens for the remainder of the loop. The larger alignment would also allow
>> aligned vector instructions instead of the currently unaligned ones.
>> 
>> 2) Give non-temporal hints when different array elements are only used once
>> per loop iteration
>> We probably need to analyze the different load and stores per loop iteration
>> for this…
> 
> You probably also want to ensure that you stay in the loop long enough, i.e. have some sort of a dynamic-trip count check or PGO data indicating this.
> 
> You essentially want to ensure that reads after the loop were not hitting in the cache even with regular stores.  (If you are writing a large area in the loop, a large percentage of those lines are already evicted by the time you exit the loop.)
> 
> Adam
> 
>> 
>> Any thoughts or any ongoing work that I'm missing?
>> 
>> Thanks,
>> Jonas
>> 
>> 
>> [1] https://www.cs.virginia.edu/stream/
>> 
>> --
>> Jonas Hahnfeld, MATSE-Auszubildender
>> 
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074  Aachen (Germany)
>> Hahnfeld at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 



More information about the llvm-dev mailing list