[LLVMdev] loop vectorizer

Wed Oct 30 14:17:48 PDT 2013

----- Original Message -----
> Hi Frank,
> 
> > We are looking at a variety of target architectures. Ultimately we
> > aim to run on BG/Q and Intel Xeon Phi (native). However, running
> > on those architectures with the LLVM technology is planned in some
> > future. As a first step we would target vanilla x86 with SSE/AVX
> > 128/256 as a proof-of-concept.
> 
> Great! It should be easy to support these targets. When you said
> wide-vectors I assumed that you mean old school vector-processors.
> Elena Demikhovsky is working on adding AVX512 support and once she
> is done things should just work. We will need to support some of the
> new features of AVX512 such as predication and scatter/gather to
> make the most out of this CPU.  I don’t know too much on BG/Q, but
> maybe Hal can provide more info.

I'm glad to hear that you're interested in BG/Q support. Somewhat off topic, so briefly: I've not pushed QPX vector support upstream yet (and may not end up with time before 3.4 branches -- starting now is probably not prudent). The current full BG/Q patchset (source and relocatable binary RPMs) is available from http://trac.alcf.anl.gov/projects/llvm-bgq; if you're running at ALCF, it is already installed for you. When you're interested in starting BG/Q work, please follow up with me (either directly, or on our llvm-bgq-discuss list (see the link on the web page)) if you have any questions or suggestions.

 -Hal

> 
> > 
> > Most of our generated functions implement pure data-parallel
> > operations which suit vector instructions. There are of course
> > some kernels that require scatter/gather but I don't worry about
> > those right now.
> 
> > What I don't understand: How can the loop vectorizer be good on a
> > small vector size but not so good on a large one? (I guess this is
> > what you're saying with SIMD vector as a 'small vector'). Isn't
> > this functionality completely generic in the loop vectorizer and
> > its algorithm doesn't care about the actual 'width' of the vector?
> > Why did you bring up gather/scatter instructions? The test function
> > doesn't make use of them.
> 
> If scatter/gather were free (or low cost), then it could allow
> vectorization of many more loops, because many times the high-cost
> of non-consecutive memory operations prevent vectorization.
> 
> > What's the role of gather/scatter in the loop vectorizer?
> 
> Simply to load/store non-consecutive memory locations.
> 
> > I know one needs to insert/extract values to/from vectors in order
> > to use them for scalar operations. But in the case here, there are
> > no scalar operations. That's what I mean with these functions
> > implement purely data-parallel/vector operations.
> > 
> > Regards whether we have other problems. That's the good news about
> > it: There are no other problem. Our applications already runs (and
> > is correct) using the LLVM JIT'er. However, only with a datalayout
> > that's not optimal for CPU architectures. In this case the
> > functions get vectorized, but the application performance gets
> > hurt due to cache thrashing. Now, applying an optimized data
> > layout, which maximizes cache line reuse, introduces these 'rem'
> > and 'div' instructions mentioned earlier which seem to let the
> > vectorizer fail (or be it the scalar evolution analysis pass).
> > 
> > Is there fundamental functionality missing in the auto vectorizer
> > when the target vector size increases to 512 bits (instead of 128
> > for example)? And why?
> > 
> 
> Scatter/Gather cost model (and possibly intrinsics), support for
> predicated instructions, AVX512 cost model.
> 
> > What needs to be done (on a high level) in order to have the auto
> > vectorizer succeed on the test function as given erlier?
> 
> Maybe you could rewrite the loop in a way that will expose contiguous
> memory accesses. Is this something you could do ?
> 
> Thanks,
> Nadav
> 
> 
> > Frank
> > 
> > 
> > On 30/10/13 15:14, Nadav Rotem wrote:
> >> Hi Frank,
> >> 
> >>> 
> >>> To answer Nadav's question. This kind of loop is generated by a
> >>> scientific library and we are in the process of evaluating
> >>> whether LLVM can be used for this research project. The target
> >>> architectures will have (very wide) vector instructions and
> >>> these loops are performance-critical to the application. Thus it
> >>> would be important that these loops can make use of the vector
> >>> units.
> >> 
> >> Does your CPU have a good scatter/gather support ?  It will be
> >> easy to add support for scatter/gather operations to the LLVM
> >> Loop-Vectorizer.  The current design focuses on SIMD vectors and
> >> it probably does not have all of the features that are needed for
> >> wide-vector vectorization.
> >> 
> >>> Right now as it seems LLVM cannot vectorize these loops. We might
> >>> have some time to look into this, but it's not sure yet.
> >>> However, high-level guidance from LLVM pros would be very
> >>> useful.
> >>> 
> >>> What is this usual way of requesting an improvement feature? Is
> >>> this mailing list the central pace to communicate?
> >>> 
> >> 
> >> You can open bugs on bugzilla, but I think that the best way to
> >> move forward is to continue the discussions on the mailing list.
> >>  Are there other workloads that are important to you ?  Are there
> >> any other problems that you ran into ?
> >> 
> >> Thanks,
> >> Nadav
> > 
> > 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory