[llvm-dev] Document to understand vectorized code

Mon Jun 14 06:16:36 PDT 2021

Dear Stefanos,

Sorry for getting back to you so late. Can you help with some document
(other than language reference manual) the purpose of using shufflevector ?

Thanks,
Sudakshina

On Tue, Jun 8, 2021 at 5:44 PM Sudakshina Dutta <sudakshina at iitgoa.ac.in>
wrote:

> Dear Stefanos,
>
> I shall go through this thoroughly and let you know.
>
> Thanks a lot.
> Sudakshina
>
> On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
> stefanos.baziotis at gmail.com> wrote:
>
>> I think that you're interested in a more general tutorial on how to
>> vectorize code. I'm sorry I'm going to plug in my own articles [1] [2], but
>> I've just tried to write a short version here and I basically almost
>> rewrote them so there was
>> no point. For your case, I'd read at least the first half of "Residual
>> Iterations" from Part 1 and most of part 2. The concept of residual
>> iterations and reductions is something you'll see over and over and it
>> happens
>> in your example too. Let me explain the use of shufflevector pretty quick
>> and then I'll explain some specifics in your example. Also, before we move
>> any further, I recommend that we turn off loop unrolling in your
>> example because it doesn't add educational value and just clutters the
>> code, so here's your updated version [3].
>>
>> --- Shufflevector ---
>>
>> If you are not sure whether you understand what the instruction alone
>> does, let me know. You can also play with this LLVM IR [4]. I recommend
>> that you _run_ this code and change the mask in line 24.
>> The reason it's used in your example (see [3]) is because the compiler
>> wants to create a vector which has `arr[0]` (referring to the C++ entity)
>> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line 17 it
>> inserts
>> it only to the 0th (first) lane of a vector (and for the time being, we
>> don't care about the other lanes which is why the first arg to
>> `insertelement` is `poison`) and in line 18, it shuffles the value. The
>> shuffle mask is the vector [0, 0, 0, 0]
>> (in short, zeroinitializer) which means in the 0th lane of the _result_
>> vector (%minmax.ident.splat) insert the value of the 0th lane of the
>> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the _result_
>> vector, insert the value of the 0th lane of the _input_ vector etc. So,
>> all lanes of the result vector get the 0th lane of the input vector.
>>
>> --- Residual Iterations ---
>>
>> Assuming that you have understood the use of residual iterations, let's
>> see where it happens in your example. First of all, notice that in `entry`,
>> there is a check whether `n` is bigger than 1, because if it's not, there
>> are no
>> iterations to do at all and in this case, the result, i.e., `max`, is
>> `arr[0]`, which is why it's loaded so early. Check the phi at line 45.
>> Then, if we make it to `for.body.preheader`, there is a check whether we
>> have at least
>> 4 iterations (the fact that this loop starts from one makes the
>> computations somewhat more complicated but not a lot). Because if we don't,
>> then there's no point going into the vectorized version. Then, skipping
>> what the vectorized
>> code does exactly for now (but check how the iteration count is rounded
>> down to the nearest multiple of 4 in line 15), check that _after_ the
>> vectorized code, the flow moves to `middle.block`,
>> which essentially checks whether there are any more iterations to do and
>> if so, jumps to the serial version of the loop (this is where the residual
>> iterations happen).
>>
>> --- Reduction ---
>>
>> In the article, I explain reductions using accumulation as an example,
>> but in your example there's another reduction, namely, finding the maximum
>> element. Notice that this is also a reduction: new_value =
>> max_of(old_value, data).
>> The reduction variable (acting as both old and new value) is `max` in
>> your example. The operation is max_of and the continuously new data is
>> arr[i].
>>
>> Finding the maximum element is a commutative operation. For instance, to
>> find the maximum element among 4, you can split them in two pairs _any_ way
>> you want, find the maximum of each pair _independently_
>> of the other pair (those two maximums are partial results and since their
>> computation is independent of each other, it can be done in parallel), and
>> finally find the maximum among these two maximums.
>>
>> What happens in the vectorized code is that each lane finds a partial
>> maximum. In line 27 it loads 4 elements and in the next two lines, it
>> "compares" them with the previous 4 elements (%vec.phi, which keeps the 4
>> partial maximums which are updated
>> in every iteration). It computes a vector which has 4 maximums, one for
>> each pair of lanes (between the new 4 elements in %wide.load, in
>> %wide.load, and %vec.phi). You might want to run an example in paper to see
>> the effect. Consider that you want to compute
>> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>>
>> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>.
>> This is the starting value of %vec.phi. Then, you load 4 values: %wide.load
>> = <10, 7, 3, 1>
>> Then, you do an element-wise comparison of the two and for each pair of
>> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2 but 1 < 2.
>> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
>> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and you
>> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2. So, %5 =
>> <10, 9, 11, 4>. Now, you have computed 4 partial maximums
>> and you only need to find the maximum among those 4 to find the total
>> maximum. That happens in line 35 and you're done.
>>
>> Hope this helps! Let me know if there are any questions.
>>
>> Best,
>> Stefanos
>>
>> [1]
>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
>> [2]
>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
>> [3] https://godbolt.org/z/nMe9a6YWa
>> [4] https://godbolt.org/z/Wf9fTq53G
>>
>> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
>> craig.topper at gmail.com> έγραψε:
>>
>>> Shufflevector is a powerful instruction that can be used to copy a
>>> scalar to every elements of a vector, concatenate smaller vectors into a
>>> larger vector, rearrange elements one or two vectors, etc. Without example
>>> code I can't say for sure what the use of shufflevector in your case is for.
>>>
>>> ~Craig
>>>
>>>
>>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <
>>> sudakshina at iitgoa.ac.in> wrote:
>>>
>>>> Dear Craig,
>>>>
>>>> Thanks for your reply. It seems the control flow of the vectorized
>>>> program is very different from that of the source code. Is there any
>>>> document describing difference in control flow, the reason for using
>>>> shufflevector, etc ?
>>>>
>>>> Sudakshina
>>>>
>>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at gmail.com> wrote:
>>>>
>>>>> There's some notes here https://llvm.org/docs/Vectorizers.html but I
>>>>> didn't look too closely at it.  If you compile with -fno-discard-values
>>>>> names or use a debug build, some of the basic blocks will be named similar
>>>>> to the CFG diagram in this section
>>>>> https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>>
>>>>> ~Craig
>>>>>
>>>>>
>>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Dear Stefanos,
>>>>>>
>>>>>> I want to understand how the generated code works. For example, the
>>>>>> code that I generated using -Rpass=loop-vectorize
>>>>>> -Rpass-analysis=loop-vectorize has many shufflevector and other
>>>>>> instructions. I wanted to have a document (if there exists any) where an
>>>>>> example on loop-vectorization is done with some explanation on the workflow
>>>>>> of the code.
>>>>>>
>>>>>> Thanks,
>>>>>> Sudakshina
>>>>>>
>>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Sudakshina,
>>>>>>>
>>>>>>> First, it helps if you can put your code in a godbolt snippet, like
>>>>>>> this [1]. It helps people in multiple ways (e.g., they don't have to
>>>>>>> download files, they can see exactly what cmd arguments you used, they can
>>>>>>> tweak the cmd arguments without having LLVM on their machine etc.).
>>>>>>>
>>>>>>> Is there any comprehensive tutorial/document to understand generated
>>>>>>>> instructions or the semantics of the vectorized code ?
>>>>>>>
>>>>>>>
>>>>>>> This is quite generic, what is more specifically that you want to
>>>>>>> understand? Do you want to understand what each individual instruction
>>>>>>> does? Do you maybe understand that but
>>>>>>> you don't know what is the general method to generate, let's say by
>>>>>>> hand, vectorized code (or more specifically, branching vectorized code). Or
>>>>>>> maybe, you want to understand
>>>>>>> how _LLVM_ generates this code, i.e., the inner workings of the
>>>>>>> vectorization passes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefanos
>>>>>>>
>>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>>
>>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina Dutta via
>>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> Greetings. I have generated a vectorized code from a C source file
>>>>>>>> (attached). Is there any comprehensive tutorial/document to understand
>>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sudakshina
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/e9011549/attachment.html>