[LLVMdev] loop vectorizer

Wed Oct 30 12:57:50 PDT 2013

Hi Nadav,

We are looking at a variety of target architectures. Ultimately we aim 
to run on BG/Q and Intel Xeon Phi (native). However, running on those 
architectures with the LLVM technology is planned in some future. As a 
first step we would target vanilla x86 with SSE/AVX 128/256 as a 
proof-of-concept.

Most of our generated functions implement pure data-parallel operations 
which suit vector instructions. There are of course some kernels that 
require scatter/gather but I don't worry about those right now.

What I don't understand: How can the loop vectorizer be good on a small 
vector size but not so good on a large one? (I guess this is what you're 
saying with SIMD vector as a 'small vector'). Isn't this functionality 
completely generic in the loop vectorizer and its algorithm doesn't care 
about the actual 'width' of the vector?

Why did you bring up gather/scatter instructions? The test function 
doesn't make use of them. What's the role of gather/scatter in the loop 
vectorizer? I know one needs to insert/extract values to/from vectors in 
order to use them for scalar operations. But in the case here, there are 
no scalar operations. That's what I mean with these functions implement 
purely data-parallel/vector operations.

Regards whether we have other problems. That's the good news about it: 
There are no other problem. Our applications already runs (and is 
correct) using the LLVM JIT'er. However, only with a datalayout that's 
not optimal for CPU architectures. In this case the functions get 
vectorized, but the application performance gets hurt due to cache 
thrashing. Now, applying an optimized data layout, which maximizes cache 
line reuse, introduces these 'rem' and 'div' instructions mentioned 
earlier which seem to let the vectorizer fail (or be it the scalar 
evolution analysis pass).

Is there fundamental functionality missing in the auto vectorizer when 
the target vector size increases to 512 bits (instead of 128 for 
example)? And why?

What needs to be done (on a high level) in order to have the auto 
vectorizer succeed on the test function as given erlier?

Frank

On 30/10/13 15:14, Nadav Rotem wrote:
> Hi Frank,
>
>>
>> To answer Nadav's question. This kind of loop is generated by a 
>> scientific library and we are in the process of evaluating whether 
>> LLVM can be used for this research project. The target architectures 
>> will have (very wide) vector instructions and these loops are 
>> performance-critical to the application. Thus it would be important 
>> that these loops can make use of the vector units.
>
> Does your CPU have a good scatter/gather support ?  It will be easy to 
> add support for scatter/gather operations to the LLVM Loop-Vectorizer. 
>  The current design focuses on SIMD vectors and it probably does not 
> have all of the features that are needed for wide-vector vectorization.
>
>> Right now as it seems LLVM cannot vectorize these loops. We might 
>> have some time to look into this, but it's not sure yet. However, 
>> high-level guidance from LLVM pros would be very useful.
>>
>> What is this usual way of requesting an improvement feature? Is this 
>> mailing list the central pace to communicate?
>>
>
> You can open bugs on bugzilla, but I think that the best way to move 
> forward is to continue the discussions on the mailing list.  Are there 
> other workloads that are important to you ?  Are there any other 
> problems that you ran into ?
>
> Thanks,
> Nadav