[LLVMdev] loop vectorizer
Frank Winter
fwinter at jlab.org
Wed Oct 30 12:57:50 PDT 2013
Hi Nadav,
We are looking at a variety of target architectures. Ultimately we aim
to run on BG/Q and Intel Xeon Phi (native). However, running on those
architectures with the LLVM technology is planned in some future. As a
first step we would target vanilla x86 with SSE/AVX 128/256 as a
proof-of-concept.
Most of our generated functions implement pure data-parallel operations
which suit vector instructions. There are of course some kernels that
require scatter/gather but I don't worry about those right now.
What I don't understand: How can the loop vectorizer be good on a small
vector size but not so good on a large one? (I guess this is what you're
saying with SIMD vector as a 'small vector'). Isn't this functionality
completely generic in the loop vectorizer and its algorithm doesn't care
about the actual 'width' of the vector?
Why did you bring up gather/scatter instructions? The test function
doesn't make use of them. What's the role of gather/scatter in the loop
vectorizer? I know one needs to insert/extract values to/from vectors in
order to use them for scalar operations. But in the case here, there are
no scalar operations. That's what I mean with these functions implement
purely data-parallel/vector operations.
Regards whether we have other problems. That's the good news about it:
There are no other problem. Our applications already runs (and is
correct) using the LLVM JIT'er. However, only with a datalayout that's
not optimal for CPU architectures. In this case the functions get
vectorized, but the application performance gets hurt due to cache
thrashing. Now, applying an optimized data layout, which maximizes cache
line reuse, introduces these 'rem' and 'div' instructions mentioned
earlier which seem to let the vectorizer fail (or be it the scalar
evolution analysis pass).
Is there fundamental functionality missing in the auto vectorizer when
the target vector size increases to 512 bits (instead of 128 for
example)? And why?
What needs to be done (on a high level) in order to have the auto
vectorizer succeed on the test function as given erlier?
Frank
On 30/10/13 15:14, Nadav Rotem wrote:
> Hi Frank,
>
>>
>> To answer Nadav's question. This kind of loop is generated by a
>> scientific library and we are in the process of evaluating whether
>> LLVM can be used for this research project. The target architectures
>> will have (very wide) vector instructions and these loops are
>> performance-critical to the application. Thus it would be important
>> that these loops can make use of the vector units.
>
> Does your CPU have a good scatter/gather support ? It will be easy to
> add support for scatter/gather operations to the LLVM Loop-Vectorizer.
> The current design focuses on SIMD vectors and it probably does not
> have all of the features that are needed for wide-vector vectorization.
>
>> Right now as it seems LLVM cannot vectorize these loops. We might
>> have some time to look into this, but it's not sure yet. However,
>> high-level guidance from LLVM pros would be very useful.
>>
>> What is this usual way of requesting an improvement feature? Is this
>> mailing list the central pace to communicate?
>>
>
> You can open bugs on bugzilla, but I think that the best way to move
> forward is to continue the discussions on the mailing list. Are there
> other workloads that are important to you ? Are there any other
> problems that you ran into ?
>
> Thanks,
> Nadav
More information about the llvm-dev
mailing list