[llvm-dev] [LLVMdev] LLVM loop vectorizer - changing vectorized code

Tue Jun 21 02:30:23 PDT 2016

   Hi, Mikhail.
     Please see answers embedded in the text below.


On 6/14/2016 2:42 AM, Mikhail Zolotukhin wrote:
> Hi Alex,
>
>> On Jun 13, 2016, at 12:22 PM, Alex Susu <alex.e.susu at gmail.com> wrote:
>>
>> Hello, Mikhail. I'm planning to do source-to-source transformation for loop
>> vectorization.
> Could you please share your reasoning on why you need to do it source-to-source? While
> I recognize that there might be external reasons to do it, I do think that working on
> IR is much easier.

     I'm currently implementing a back end for BPF (Berkeley Processor Frontend) + Connex 
SIMD processor (using register allocator, codegen, etc). Note that Connex is a research 
processor.
     Part of this work required us to also work with LoopVectorize.cpp, since we need to 
add memory transfers at vector loads and stores, specific branch instructions, etc.
But, in principle we also wish to generate readable C/C++ code, not assembly if possible, 
since we have developed an assembly library written in C++ . The most important reason is 
actually it allows me to generate C++ program for both the CPU (which can be BPF, but also 
ARM, x86, etc) and the Connex SIMD processor.

>> Basically I want to generate C (C++) code from C (C++) source code: - the code that
>> is not vectorized remains the same - this would be simple to achieve if we can obtain
>> precisely the source location of each statement;
> If you work completely in front-end, without generating IR, then yes, it's probably
> true. But the most complicated part though would be to check if vectorization is
     There is a project performing vectorization in the front-end: 
https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/scout/publications 
. It does perform loop unroll-and-jam for strip-mining, loop collapsing 
(http://www.t-systems-sfr.com/e/downloads/2010/vortraege/1Krzikalla.pdf) and what they 
call if-collection and register blocking.

> legal. Even in IR it's not a trivial task - if you want the same level of
> error-proofness as we have now, I'm afraid you'll end with just another IR internal to
> your transformation.
     Indeed, vectorization in the front-end requires implementing quite a few loop 
transformations, which are normally already available in LLVM IR. But they should be 
simpler to implement since they target just vectorization.

> For one, think about how would you handle memory aliasing.


> If you do lower to IR first, then there is no "the code that is not vectorized remains
> the same" - it's already mutated by previous passes anyway. E.g. what if the loop was
> distributed/unrolled before vectorization?

     This is an interesting detail I haven't considered.
     If the loop is unrolled in order to basically perform strip-mining (and reduce 
control-flow instruction intensity, as I can see in the code generated by 
LoopVectorize.cpp) then I guess I could still have a chance in doing what I've written 
above, since unrolling doesn't change the rest of the code.
     However, for loop distribution (fission) the situation is more complicated - I need 
to better understand your LoopVectorize.cpp - but I understand that you can have in a loop 
several patterns to be vectorized independently and possibly also some parts of the loop 
that are not vectorizable (ex: computing Fibonacci numbers). Something like:
     for (i = 2; i < N; i++) {
         c[i] = a[i] + b[i];
         reduct[i] += a[i];
         fib[i] = fib[i - 1] + fib[i - 2];
     }
     Clearly, this loop can benefit from loop fission. I've checked this piece of code on 
LLVM and it does NOT get vectorized because of the fib computation and because 
LoopVectorize.cpp does not perform loop fission. But if we take out the fib computation it 
does get vectorized.
     So, I could still experiment carefully with C/C++ code generation from 
LoopVectorize.cpp, while I find a better solution .

     What other transformations would be beneficial for vectorization? I guess loop 
blocking, software pipelining, loop interchange, etc.


>> - the code that gets vectorized I want to translate in C code the parts that are
>> sequential and generate SIMD intrinsics for my SIMD processor where normally it would
>> generate vector instructions. I started looking at InnerLoopVectorizer::vectorize()
>> and InnerLoopVectorizer::createEmptyLoop(). Not generating LLVM code but C/C++ code
>> (with the help of LLVM intrinsics) is not trivial, but it should be reasonably simple
>> to achieve.
> What you suggest here is like writing a C backend and teach it to generate intrinsics
> for vector code (such backend existed some time ago btw). It should be doable, but I
> wouldn't call it simple:)

     I thought of using a C back end of LLVM (such as 
https://github.com/JuliaComputing/llvm-cbe or 
https://github.com/draperlaboratory/llvm-cbe), but I did not do it because I thought it's 
no longer maintained, but actually it seems it is.

   Thank you,
      Alex


> Thanks, Michael
>>
>> Would you advise for such an operation as the one described above?  I guess doing
>> this as a Clang phase (working on the source code) is not really a bad idea either,
>> since I would have better control on source code, but I would need to reimplement the
>> loop vectorizer algorithm that is currently implemented on LLVM code.
>>
>> Thank you, Alex
>>
>> On 6/4/2016 4:28 AM, Mikhail Zolotukhin wrote:
>>> Hi Alex,
>>>
>>> I think the changes you want are actually not vectorizer related. Vectorizer just
>>> uses data provided by other passes.
>>>
>>> What you probably might want is to look into routine Loop::getStartLoc() (see
>>> lib/Analysis/LoopInfo.cpp). If you find a way to improve it, patches are welcome:)
>>>
>>> Thanks, Michael
>>>
>>>> On Jun 3, 2016, at 6:13 PM, Alex Susu <alex.e.susu at gmail.com> wrote:
>>>>
>>>> Hello. Mikhail, I come back to this older thread. I need to do a few changes to
>>>> LoopVectorize.cpp.
>>>>
>>>> One of them is related to figuring out the exact C source line and column number
>>>> of the loops being vectorized. I've noticed that a recent version of
>>>> LoopVectorize.cpp prints imprecise debug info for vectorized loops such as, for
>>>> example, the location of a character of an assignment statement inside the
>>>> respective loop. It would help me a lot in my project to find the exact C source
>>>> line and column number of the first and last character of the loop being
>>>> vectorized. (imprecise location would make my life more complicated). Is this
>>>> feasible? Or are there limitations at the level of clang of retrieving the exact
>>>> C source line and column number location of the beginning and end of a loop (it
>>>> can include indent chars before and after the loop)? (I've seen other examples
>>>> with imprecise location such as the "Reading diagnostics" chapter in the book
>>>> https://books.google.ro/books?isbn=1782166939 .)
>>>>
>>>> Note: to be able to retrieve the debug info from the C source file we require to
>>>> run clang with -Rpass* options, as discussed before. Otherwise, if we run clang
>>>> first, then opt on the resulting .ll file which runs LoopVectorize, we lose the C
>>>> source file debug info (DebugLoc class, etc) and obtain the debug info from the
>>>> .ll file. An example: clang -O3 3better.c -arch=mips -ffast-math -Rpass=debug
>>>> -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -S -emit-llvm -fvectorize
>>>> -mllvm -debug -mllvm -force-vector-width=16 -save-temps
>>>>
>>>> Thank you, Alex
>>>>
>>>>
>>>>
>>>> On 2/18/2016 2:17 AM, Mikhail Zolotukhin wrote:
>>>>> Hi Alex,
>>>>>
>>>>> I'm not aware of efforts on loop coalescing in LLVM, but probably polly can do
>>>>> something like this. Also, one related thought: it might be worth making it a
>>>>> separate pass, not a part of loop vectorizer. LLVM already has several
>>>>> 'utility' passes (e.g. loop rotation), which primarily aims at enabling other
>>>>> passes.
>>>>>
>>>>> Thanks, Michael
>>>>>
>>>>>> On Feb 15, 2016, at 6:44 AM, RCU <alex.e.susu at gmail.com
>>>>>> <mailto:alex.e.susu at gmail.com>> wrote:
>>>>>>
>>>>>> Hello, Michael. I come back to this older email. Sorry if you receive it
>>>>>> again.
>>>>>>
>>>>>> I am trying to implement coalescing/collapsing of nested loops. This would
>>>>>> be clearly beneficial for the loop vectorizer, also. I'm normally planning to
>>>>>> start modifying the LLVM loop vectorizer to add loop coalescing of the LLVM
>>>>>> language.
>>>>>>
>>>>>> Are you aware of a similar effort on loop coalescing in LLVM (maybe even a
>>>>>> different LLVM pass, not related to the LLVM loop vectorizer)?
>>>>>>
>>>>>> Thank you, Alex
>>>>>>
>>>>>> On 7/9/2015 10:38 AM, RCU wrote:
>>>>>>>
>>>>>>>
>>>>>>> With best regards, Alex Susu
>>>>>>>
>>>>>>> On 7/8/2015 9:17 PM, Michael Zolotukhin wrote:
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> Example from the link you provided looks like this:
>>>>>>>>
>>>>>>>> |for  (i=0;  i<M;  i++  ){ z[i]=0; for  (ckey=row_ptr[i];
>>>>>>>> ckey<row_ptr[i+1]; ckey++)  { z[i]  +=  data[ckey]*x[colind[ckey]]; } }|
>>>>>>>>
>>>>>>>> Is it the loop you are trying to vectorize? I don’t see any ‘if’ inside
>>>>>>>> the innermost loop.
>>>>>>> I tried to simplify this code in the hope the loop vectorizer can take care
>>>>>>> of it better: I linearized...
>>>>>>>
>>>>>>>> But anyway, here vectorizer might have following troubles: 1) iteration
>>>>>>>> count of the innermost loop is unknown. 2) Gather accesses ( a[b[i]] ).
>>>>>>>> With AVX512 set of instructions it’s possible to generate efficient code
>>>>>>>> for such case, but a) I think it’s not supported yet, b) if this ISA
>>>>>>>> isn’t available, then vectorized code would need to ‘manually’ gather
>>>>>>>> scalar values to vector, which might be slow (and thus, vectorizer might
>>>>>>>> decide to leave the code scalar).
>>>>>>>>
>>>>>>>> And here is a list of papers vectorizer is based on: // The
>>>>>>>> reduction-variable vectorization is based on the paper: //  D. Nuzman and
>>>>>>>> R. Henderson. Multi-platform Auto-vectorization. // // Variable
>>>>>>>> uniformity checks are inspired by: //  Karrenberg, R. and Hack, S. Whole
>>>>>>>> Function Vectorization. // // The interleaved access vectorization is
>>>>>>>> based on the paper: //  Dorit Nuzman, Ira Rosen and Ayal Zaks.
>>>>>>>> Auto-Vectorization of Interleaved //  Data for SIMD // // Other
>>>>>>>> ideas/concepts are from: //  A. Zaks and D. Nuzman. Autovectorization in
>>>>>>>> GCC-two years later. // //  S. Maleki, Y. Gao, M. Garzaran, T. Wong and
>>>>>>>> D. Padua. An Evaluation of // Vectorizing Compilers. And probably, some
>>>>>>>> of the parts are written from scratch with no reference to a paper.
>>>>>>>>
>>>>>>>> The presentations you found are a good starting point, but while they’re
>>>>>>>> still good from getting basics of the vectorizer, they are a bit outdated
>>>>>>>> now in a sense that a lot of new features has been added since then (and
>>>>>>>> bugs fixed:) ). Also, I’d recommend trying a newer LLVM version - I don’t
>>>>>>>> think it’ll handle the example above, but it would be much more
>>>>>>>> convenient to investigate why the loop isn’t vectorized and fix
>>>>>>>> vectorizer if we figure out how.
>>>>>>>>
>>>>>>>> Best regards, Michael
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for the papers - these appear to be written in the header of the
>>>>>>> file implementing the loop vect. tranformation (found at
>>>>>>> "where-you-want-llvm-to-live"/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
>>>>>>>
>>>>>>>
).
>>>>>>>
>>>>>>>>> On Jul 8, 2015, at 10:01 AM, RCU <alex.e.susu at gmail.com
>>>>>>>>> <mailto:alex.e.susu at gmail.com><mailto:alex.e.susu at gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>> Hello. I am trying to vectorize a CSR SpMV (sparse matrix vector
>>>>>>>>> multiplication) procedure but the LLVM loop vectorizer is not able to
>>>>>>>>> handle such code. I am using cland and llvm version 3.4 (on Ubuntu
>>>>>>>>> 12.10). I use the -fvectorize option with clang and -loop-vectorize
>>>>>>>>> with opt-3.4 . The CSR SpMV function is inspired from
>>>>>>>>> http://stackoverflow.com/questions/13636464/slow-sparse-matrix-vector-product-csr-using-open-mp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>
>>>>>>>>>
(I can provide the exact code samples used).
>>>>>>>>>
>>>>>>>>> Basically the problem is the loop vectorizer does NOT work with if
>>>>>>>>> inside loop (be it 2 nested loops or a modification of SpMV I did with
>>>>>>>>> just 1 loop - I can provide the exact code) changing the value of the
>>>>>>>>> accumulator z. I can sort of understand why LLVM isn't able to
>>>>>>>>> vectorize the code. However,
>>>>>>>>> athttp://llvm.org/docs/Vectorizers.html#if-conversionit is written:
>>>>>>>>> <<The Loop Vectorizer is able to "flatten" the IF statement in the code
>>>>>>>>> and generate a single stream of instructions. The Loop Vectorizer
>>>>>>>>> supports any control flow in the innermost loop. The innermost loop may
>>>>>>>>> contain complex nesting of IFs, ELSEs and even GOTOs.>> Could you
>>>>>>>>> please tell me what are these lines exactly trying to say.
>>>>>>>>>
>>>>>>>>> Could you please tell me what algorithm is the LLVM loop vectorizer
>>>>>>>>> using (maybe the algorithm is described in a paper) - I currently found
>>>>>>>>> only 2 presentations on this
>>>>>>>>> topic:http://llvm.org/devmtg/2013-11/slides/Rotem-Vectorization.pdfand
>>>>>>>>> https://archive.fosdem.org/2014/schedule/event/llvmautovec/attachments/audio/321/export/events/attachments/llvmautovec/audio/321/AutoVectorizationLLVM.pdf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>
>>>>>>>>>
.
>>>>>>>>>
>>>>>>>>> Thank you very much, Alex
>>>>>>>>> _______________________________________________ LLVM Developers mailing
>>>>>>>>> list LLVMdev at cs.uiuc.edu
>>>>>>>>> <mailto:LLVMdev at cs.uiuc.edu><mailto:LLVMdev at cs.uiuc.edu>http://llvm.cs.uiuc.edu
>>>>>>>>>
>>>>>>>>>
>>>>
>>>>>>>>>
>>
>>>>>>>>>
<http://llvm.cs.uiuc.edu/>
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>
>>>
>
>