[llvm-dev] Document to understand vectorized code

Sudakshina Dutta via llvm-dev llvm-dev at lists.llvm.org
Tue Jun 15 18:40:31 PDT 2021


Dear Stefanos,

I understood it.

Thanks a lot for your help.
Sudakshina

On Mon, Jun 14, 2021 at 7:26 PM Stefanos Baziotis <
stefanos.baziotis at gmail.com> wrote:

> Hi Sudakshina,
>
> I'm not sure what you're looking for. Assuming that you have understood
> _what_ the instruction does, the _purpose_ of using it in this specific
> case is what I explained above; broadcasting `arr[0]` to all the lanes of a
> vector.
> The _purpose_ of using it in general is not a single one, as Crag said
> (quoting):
> > Shufflevector is a powerful instruction that can be used to copy a
> scalar to every elements of a vector, concatenate smaller vectors into a
> larger vector, rearrange elements one or two vectors, etc.
>
> You can see that your example falls in the first case he described. If you
> want a document showcasing examples of other cases, then I'm sorry, I don't
> have such a thing.
>
> But are you sure you understand _what_ the instruction does (again, not
> _why_ someone decided to use it, but _what_ it does) ? If not, I can try to
> explain along with the supporting godbolt I sent previously.
>
> Best,
> Stefanos
>
> Στις Δευ, 14 Ιουν 2021 στις 4:16 μ.μ., ο/η Sudakshina Dutta <
> sudakshina at iitgoa.ac.in> έγραψε:
>
>> Dear Stefanos,
>>
>> Sorry for getting back to you so late. Can you help with some document
>> (other than language reference manual) the purpose of using shufflevector ?
>>
>> Thanks,
>> Sudakshina
>>
>> On Tue, Jun 8, 2021 at 5:44 PM Sudakshina Dutta <sudakshina at iitgoa.ac.in>
>> wrote:
>>
>>> Dear Stefanos,
>>>
>>> I shall go through this thoroughly and let you know.
>>>
>>> Thanks a lot.
>>> Sudakshina
>>>
>>> On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
>>> stefanos.baziotis at gmail.com> wrote:
>>>
>>>> I think that you're interested in a more general tutorial on how to
>>>> vectorize code. I'm sorry I'm going to plug in my own articles [1] [2], but
>>>> I've just tried to write a short version here and I basically almost
>>>> rewrote them so there was
>>>> no point. For your case, I'd read at least the first half of "Residual
>>>> Iterations" from Part 1 and most of part 2. The concept of residual
>>>> iterations and reductions is something you'll see over and over and it
>>>> happens
>>>> in your example too. Let me explain the use of shufflevector pretty
>>>> quick and then I'll explain some specifics in your example. Also, before we
>>>> move any further, I recommend that we turn off loop unrolling in your
>>>> example because it doesn't add educational value and just clutters the
>>>> code, so here's your updated version [3].
>>>>
>>>> --- Shufflevector ---
>>>>
>>>> If you are not sure whether you understand what the instruction alone
>>>> does, let me know. You can also play with this LLVM IR [4]. I recommend
>>>> that you _run_ this code and change the mask in line 24.
>>>> The reason it's used in your example (see [3]) is because the compiler
>>>> wants to create a vector which has `arr[0]` (referring to the C++ entity)
>>>> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line 17 it
>>>> inserts
>>>> it only to the 0th (first) lane of a vector (and for the time being, we
>>>> don't care about the other lanes which is why the first arg to
>>>> `insertelement` is `poison`) and in line 18, it shuffles the value. The
>>>> shuffle mask is the vector [0, 0, 0, 0]
>>>> (in short, zeroinitializer) which means in the 0th lane of the _result_
>>>> vector (%minmax.ident.splat) insert the value of the 0th lane of the
>>>> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the _result_
>>>> vector, insert the value of the 0th lane of the _input_ vector etc. So,
>>>> all lanes of the result vector get the 0th lane of the input vector.
>>>>
>>>> --- Residual Iterations ---
>>>>
>>>> Assuming that you have understood the use of residual iterations, let's
>>>> see where it happens in your example. First of all, notice that in `entry`,
>>>> there is a check whether `n` is bigger than 1, because if it's not, there
>>>> are no
>>>> iterations to do at all and in this case, the result, i.e., `max`, is
>>>> `arr[0]`, which is why it's loaded so early. Check the phi at line 45.
>>>> Then, if we make it to `for.body.preheader`, there is a check whether we
>>>> have at least
>>>> 4 iterations (the fact that this loop starts from one makes the
>>>> computations somewhat more complicated but not a lot). Because if we don't,
>>>> then there's no point going into the vectorized version. Then, skipping
>>>> what the vectorized
>>>> code does exactly for now (but check how the iteration count is rounded
>>>> down to the nearest multiple of 4 in line 15), check that _after_ the
>>>> vectorized code, the flow moves to `middle.block`,
>>>> which essentially checks whether there are any more iterations to do
>>>> and if so, jumps to the serial version of the loop (this is where the
>>>> residual iterations happen).
>>>>
>>>> --- Reduction ---
>>>>
>>>> In the article, I explain reductions using accumulation as an example,
>>>> but in your example there's another reduction, namely, finding the maximum
>>>> element. Notice that this is also a reduction: new_value =
>>>> max_of(old_value, data).
>>>> The reduction variable (acting as both old and new value) is `max` in
>>>> your example. The operation is max_of and the continuously new data is
>>>> arr[i].
>>>>
>>>> Finding the maximum element is a commutative operation. For instance,
>>>> to find the maximum element among 4, you can split them in two pairs _any_
>>>> way you want, find the maximum of each pair _independently_
>>>> of the other pair (those two maximums are partial results and since
>>>> their computation is independent of each other, it can be done in
>>>> parallel), and finally find the maximum among these two maximums.
>>>>
>>>> What happens in the vectorized code is that each lane finds a partial
>>>> maximum. In line 27 it loads 4 elements and in the next two lines, it
>>>> "compares" them with the previous 4 elements (%vec.phi, which keeps the 4
>>>> partial maximums which are updated
>>>> in every iteration). It computes a vector which has 4 maximums, one for
>>>> each pair of lanes (between the new 4 elements in %wide.load, in
>>>> %wide.load, and %vec.phi). You might want to run an example in paper to see
>>>> the effect. Consider that you want to compute
>>>> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>>>>
>>>> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>.
>>>> This is the starting value of %vec.phi. Then, you load 4 values: %wide.load
>>>> = <10, 7, 3, 1>
>>>> Then, you do an element-wise comparison of the two and for each pair of
>>>> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2 but 1 < 2.
>>>> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
>>>> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and you
>>>> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2. So, %5 =
>>>> <10, 9, 11, 4>. Now, you have computed 4 partial maximums
>>>> and you only need to find the maximum among those 4 to find the total
>>>> maximum. That happens in line 35 and you're done.
>>>>
>>>> Hope this helps! Let me know if there are any questions.
>>>>
>>>> Best,
>>>> Stefanos
>>>>
>>>> [1]
>>>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
>>>> [2]
>>>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
>>>> [3] https://godbolt.org/z/nMe9a6YWa
>>>> [4] https://godbolt.org/z/Wf9fTq53G
>>>>
>>>> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
>>>> craig.topper at gmail.com> έγραψε:
>>>>
>>>>> Shufflevector is a powerful instruction that can be used to copy a
>>>>> scalar to every elements of a vector, concatenate smaller vectors into a
>>>>> larger vector, rearrange elements one or two vectors, etc. Without example
>>>>> code I can't say for sure what the use of shufflevector in your case is for.
>>>>>
>>>>> ~Craig
>>>>>
>>>>>
>>>>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <
>>>>> sudakshina at iitgoa.ac.in> wrote:
>>>>>
>>>>>> Dear Craig,
>>>>>>
>>>>>> Thanks for your reply. It seems the control flow of the vectorized
>>>>>> program is very different from that of the source code. Is there any
>>>>>> document describing difference in control flow, the reason for using
>>>>>> shufflevector, etc ?
>>>>>>
>>>>>> Sudakshina
>>>>>>
>>>>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> There's some notes here https://llvm.org/docs/Vectorizers.html but
>>>>>>> I didn't look too closely at it.  If you compile with -fno-discard-values
>>>>>>> names or use a debug build, some of the basic blocks will be named similar
>>>>>>> to the CFG diagram in this section
>>>>>>> https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>>>>
>>>>>>> ~Craig
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> Dear Stefanos,
>>>>>>>>
>>>>>>>> I want to understand how the generated code works. For example, the
>>>>>>>> code that I generated using -Rpass=loop-vectorize
>>>>>>>> -Rpass-analysis=loop-vectorize has many shufflevector and other
>>>>>>>> instructions. I wanted to have a document (if there exists any) where an
>>>>>>>> example on loop-vectorization is done with some explanation on the workflow
>>>>>>>> of the code.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sudakshina
>>>>>>>>
>>>>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Sudakshina,
>>>>>>>>>
>>>>>>>>> First, it helps if you can put your code in a godbolt snippet,
>>>>>>>>> like this [1]. It helps people in multiple ways (e.g., they don't have to
>>>>>>>>> download files, they can see exactly what cmd arguments you used, they can
>>>>>>>>> tweak the cmd arguments without having LLVM on their machine etc.).
>>>>>>>>>
>>>>>>>>> Is there any comprehensive tutorial/document to understand
>>>>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is quite generic, what is more specifically that you want to
>>>>>>>>> understand? Do you want to understand what each individual instruction
>>>>>>>>> does? Do you maybe understand that but
>>>>>>>>> you don't know what is the general method to generate, let's say
>>>>>>>>> by hand, vectorized code (or more specifically, branching vectorized code).
>>>>>>>>> Or maybe, you want to understand
>>>>>>>>> how _LLVM_ generates this code, i.e., the inner workings of the
>>>>>>>>> vectorization passes.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefanos
>>>>>>>>>
>>>>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>>>>
>>>>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina Dutta via
>>>>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>>>>
>>>>>>>>>> Dear all,
>>>>>>>>>>
>>>>>>>>>> Greetings. I have generated a vectorized code from a C source
>>>>>>>>>> file (attached). Is there any comprehensive tutorial/document to understand
>>>>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Sudakshina
>>>>>>>>>> _______________________________________________
>>>>>>>>>> LLVM Developers mailing list
>>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210616/b2c595d1/attachment.html>


More information about the llvm-dev mailing list