[llvm-dev] [LLVMdev] LLVM loop vectorizer

Thu Sep 17 16:49:06 PDT 2015

Resending to new llvm-dev...

> On Sep 17, 2015, at 4:48 PM, Michael Zolotukhin <mzolotukhin at apple.com> wrote:
> 
> Hi Alex,
> 
>> On Sep 17, 2015, at 4:04 PM, Alex Susu <alex.e.susu at gmail.com <mailto:alex.e.susu at gmail.com>> wrote:
>> 
>>   Hello, Michael,
>>     Thank you for the answer.
>>     I think I found the reason why the code does not get vectorized at all (none of the loops get vectorized): LLVM's LoopVectorizer does NOT work well with the inner loop doing reduction on float, even if using -ffast-math.
> Yes, LLVM Vectorizer has some problems with float reductions, but there has been some work done on this recently (and more to come). Also, diagnostics are much better than they were before. So, I tried the following case (which is very similar to yours) with a recent compiler:
> $ cat red.c
> float foo(float *A, float *B, int *C, int N) {
>   int i = 0;
>   float r = 0.0;
>   for (i = 0; i < N; i++) {
>     r += A[i] * B[C[i]];
>   }
>   return r;
> }
> 
> It's not vectorized without '-ffast-math':
> $ bin/clang -O3 red.c -Rpass=loop-vectorize -S -Rpass-analysis=loop-vectorize
> red.c:6:7: remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'.
>       [-Rpass-analysis=loop-vectorize]
>     r += A[i] * B[C[i]];
>       ^
> 
> but even with '-ffast-math' the cost heuristic says that vectorization isn't beneficial:
> $ bin/clang -O3 red.c -Rpass=loop-vectorize -S -Rpass-analysis=loop-vectorize -ffast-math
> red.c:6:10: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-analysis=loop-vectorize]
>     r += A[i] * B[C[i]];
>          ^
> red.c:6:10: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-analysis=loop-vectorize]
> 
> When I overrode the cost model by adding the suggested pragma before the loop, I finally got the loop vectorized:
> $ bin/clang -O3 red.c -Rpass=loop-vectorize -S -Rpass-analysis=loop-vectorize
> red.c:6:10: remark: vectorized loop (vectorization width: 2, interleaved count: 2) [-Rpass=loop-vectorize]
>     r += A[i] * B[C[i]];
>          ^
> 
> BTW, I don't expect much gains from vectorization of this particular loop. Expression B[C[i]] requires a lot of auxiliary instructions in vector version, so it mitigates most of the gains from vectorization.
> 
> 
>>     I attach sample code that shows this - however, the float NON-vectorization is not always happening - mabye it's some memory corruption in the LLVM LoopVectorizer, which sometimes results in a bad state.
> Do you mean that sometimes you got the float loop vectorized? If that’s so, it sounds really strange..
> 
>> 
>>     Let me know if I can provide more info.
>>     I'd like to mention I'm using the LLVM built from the repository - downloaded on Jul 14th, 2015.
> If you really want to vectorize this loop, I'd suggest updating LLVM and using pragma. If you have any questions, I'd be glad to help with them, if I can.
> 
> Best regards,
> Michael
> 
>> 
>> 
>>   Best regards,
>>     Alex
>> 
>> 
>> 
>> On Wed, Jul 8, 2015 at 9:17 PM, Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
>> Hi Alex,
>> 
>> Example from the link you provided looks like this:
>> for (i=0; i<M; i++ ){ 
>>     z[i]=0;
>>     for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
>>       z[i] += data[ckey]*x[colind[ckey]];
>>     }              
>>   }
>> Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside the innermost loop.
>> 
>> But anyway, here vectorizer might have following troubles:
>> 1) iteration count of the innermost loop is unknown.
>> 2) Gather accesses ( a[b[i]] ). With AVX512 set of instructions it’s possible to generate efficient code for such case, but a) I think it’s not supported yet, b) if this ISA isn’t available, then vectorized code would need to ‘manually’ gather scalar values to vector, which might be slow (and thus, vectorizer might decide to leave the code scalar).
>> 
>> And here is a list of papers vectorizer is based on:
>> // The reduction-variable vectorization is based on the paper:
>> //  D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.
>> //
>> // Variable uniformity checks are inspired by:
>> //  Karrenberg, R. and Hack, S. Whole Function Vectorization.
>> //
>> // The interleaved access vectorization is based on the paper:
>> //  Dorit Nuzman, Ira Rosen and Ayal Zaks.  Auto-Vectorization of Interleaved
>> //  Data for SIMD
>> //
>> // Other ideas/concepts are from:
>> //  A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.
>> //
>> //  S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua.  An Evaluation of
>> //  Vectorizing Compilers.
>> And probably, some of the parts are written from scratch with no reference to a paper.
>> 
>> The presentations you found are a good starting point, but while they’re still good from getting basics of the vectorizer, they are a bit outdated now in a sense that a lot of new features has been added since then (and bugs fixed:) ). Also, I’d recommend trying a newer LLVM version - I don’t think it’ll handle the example above, but it would be much more convenient to investigate why the loop isn’t vectorized and fix vectorizer if we figure out how.
>> 
>> Best regards,
>> Michael
>> 
>>> On Jul 8, 2015, at 10:01 AM, RCU <alex.e.susu at gmail.com <mailto:alex.e.susu at gmail.com>> wrote:
>>> 
>>>  Hello.
>>>    I am trying to vectorize a CSR SpMV (sparse matrix vector multiplication) procedure but the LLVM loop vectorizer is not able to handle such code.
>>>    I am using cland and llvm version 3.4 (on Ubuntu 12.10). I use the -fvectorize option with clang and -loop-vectorize with opt-3.4 .
>>>    The CSR SpMV function is inspired from http://stackoverflow.com/questions/13636464/slow-sparse-matrix-vector-product-csr-using-open-mp <http://stackoverflow.com/questions/13636464/slow-sparse-matrix-vector-product-csr-using-open-mp> (I can provide the exact code samples used).
>>> 
>>>    Basically the problem is the loop vectorizer does NOT work with if inside loop (be it 2 nested loops or a modification of SpMV I did with just 1 loop - I can provide the exact code) changing the value of the accumulator z. I can sort of understand why LLVM isn't able to vectorize the code.
>>>    However, at http://llvm.org/docs/Vectorizers.html#if-conversion <http://llvm.org/docs/Vectorizers.html#if-conversion> it is written:
>>>            <<The Loop Vectorizer is able to "flatten" the IF statement in the code and generate a single stream of instructions.
>>>              The Loop Vectorizer supports any control flow in the innermost loop.
>>>              The innermost loop may contain complex nesting of IFs, ELSEs and even GOTOs.>>
>>>     Could you please tell me what are these lines exactly trying to say.
>>> 
>>>    Could you please tell me what algorithm is the LLVM loop vectorizer using (maybe the algorithm is described in a paper) - I currently found only 2 presentations on this topic: http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf <http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdf> and https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf <https://urldefense.proofpoint.com/v2/url?u=https-3A__archive.fosdem.org_2014_schedule_event_llvmautovec_attachments_audio_321_export_events_attachments_llvmautovec_audio_321_AutoVectorizationLLVM.pdf&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=ygVmcuuQ1MUhRUoJm-IgPtgjmvM0byfjlHDg99vufEI&m=wB_Hvsma5X84froglc2I4UFz1HGlCGpZpM7nxKKO_B8&s=pi6FRG39lC4JeMV5Onu3CZkX34HTM_85dUhLBBr4dKI&e=>.
>>> 
>>>  Thank you very much,
>>>      Alex
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>         http://llvm.cs.uiuc.edu <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.cs.uiuc.edu&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=ygVmcuuQ1MUhRUoJm-IgPtgjmvM0byfjlHDg99vufEI&m=wB_Hvsma5X84froglc2I4UFz1HGlCGpZpM7nxKKO_B8&s=FRPFoePA5qDz4sI9_rCux2_hmquZMsQkZdiz5BsCUmI&e=>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.cs.uiuc.edu_mailman_listinfo_llvmdev&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=ygVmcuuQ1MUhRUoJm-IgPtgjmvM0byfjlHDg99vufEI&m=wB_Hvsma5X84froglc2I4UFz1HGlCGpZpM7nxKKO_B8&s=Dd31y32iLFcIPCQ4SwWGWfg1Fc0g5dGL6xRRJxnXaIY&e=>
>> 
>> 
>> <test_case_LoopVectorizer.zip>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150917/ad063e8a/attachment.html>