[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer
Adam Nemet via llvm-dev
llvm-dev at lists.llvm.org
Tue May 3 10:29:19 PDT 2016
> On May 3, 2016, at 10:21 AM, Adam Nemet <anemet at apple.com> wrote:
>
>
>> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> Hello all,
>>
>> I've been wondering why Clang doesn't generate non-temporal stores when
>> compiling the STREAM benchmark [1] and therefore doesn't yield optimal
>> results.
>>
>> It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
>> operations and also merges the loads and stores into vector operations.
>> However it doesn't add the '!nontemporal' metadata which would be needed for
>> maximal bandwidth on X86.
Also MichaelZ introduced builtins last year to manually force the generation of non-temporal loads and stores: __builtin_nontemporal_load/store. I believe these are documented.
>> I briefly looked into this and for non-temporal memory instructions to work,
>> the memory address would have to be aligned to the vector length which
>> currently isn't the case neither.
>>
>> To summarize the following things would be needed to give non-temporal
>> hints:
>> 1) Ensure correct alignment of merged vector memory instructions
>> This could be implemented by executing the first (scalar) loop iterations
>> until the addresses for loads and stores are aligned, similar to what already
>> happens for the remainder of the loop. The larger alignment would also allow
>> aligned vector instructions instead of the currently unaligned ones.
>>
>> 2) Give non-temporal hints when different array elements are only used once
>> per loop iteration
>> We probably need to analyze the different load and stores per loop iteration
>> for this…
>
> You probably also want to ensure that you stay in the loop long enough, i.e. have some sort of a dynamic-trip count check or PGO data indicating this.
>
> You essentially want to ensure that reads after the loop were not hitting in the cache even with regular stores. (If you are writing a large area in the loop, a large percentage of those lines are already evicted by the time you exit the loop.)
>
> Adam
>
>>
>> Any thoughts or any ongoing work that I'm missing?
>>
>> Thanks,
>> Jonas
>>
>>
>> [1] https://www.cs.virginia.edu/stream/
>>
>> --
>> Jonas Hahnfeld, MATSE-Auszubildender
>>
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074 Aachen (Germany)
>> Hahnfeld at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
More information about the llvm-dev
mailing list