[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Hahnfeld, Jonas via llvm-dev llvm-dev at lists.llvm.org
Tue May 3 03:40:22 PDT 2016

Hello all,

I've been wondering why Clang doesn't generate non-temporal stores when
compiling the STREAM benchmark [1] and therefore doesn't yield optimal

It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
operations and also merges the loads and stores into vector operations.
However it doesn't add the '!nontemporal' metadata which would be needed for
maximal bandwidth on X86.
I briefly looked into this and for non-temporal memory instructions to work,
the memory address would have to be aligned to the vector length which
currently isn't the case neither.

To summarize the following things would be needed to give non-temporal
1) Ensure correct alignment of merged vector memory instructions
This could be implemented by executing the first (scalar) loop iterations
until the addresses for loads and stores are aligned, similar to what already
happens for the remainder of the loop. The larger alignment would also allow
aligned vector instructions instead of the currently unaligned ones.

2) Give non-temporal hints when different array elements are only used once
per loop iteration
We probably need to analyze the different load and stores per loop iteration
for this...

Any thoughts or any ongoing work that I'm missing?


[1] https://www.cs.virginia.edu/stream/

Jonas Hahnfeld, MATSE-Auszubildender

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Hahnfeld at itc.rwth-aachen.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5868 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160503/e93dc440/attachment.bin>

More information about the llvm-dev mailing list