[llvm-dev] Document to understand vectorized code

Stefanos Baziotis via llvm-dev llvm-dev at lists.llvm.org
Mon Jun 14 06:56:03 PDT 2021


Hi Sudakshina,

I'm not sure what you're looking for. Assuming that you have understood
_what_ the instruction does, the _purpose_ of using it in this specific
case is what I explained above; broadcasting `arr[0]` to all the lanes of a
vector.
The _purpose_ of using it in general is not a single one, as Crag said
(quoting):
> Shufflevector is a powerful instruction that can be used to copy a scalar
to every elements of a vector, concatenate smaller vectors into a larger
vector, rearrange elements one or two vectors, etc.

You can see that your example falls in the first case he described. If you
want a document showcasing examples of other cases, then I'm sorry, I don't
have such a thing.

But are you sure you understand _what_ the instruction does (again, not
_why_ someone decided to use it, but _what_ it does) ? If not, I can try to
explain along with the supporting godbolt I sent previously.

Best,
Stefanos

Στις Δευ, 14 Ιουν 2021 στις 4:16 μ.μ., ο/η Sudakshina Dutta <
sudakshina at iitgoa.ac.in> έγραψε:

> Dear Stefanos,
>
> Sorry for getting back to you so late. Can you help with some document
> (other than language reference manual) the purpose of using shufflevector ?
>
> Thanks,
> Sudakshina
>
> On Tue, Jun 8, 2021 at 5:44 PM Sudakshina Dutta <sudakshina at iitgoa.ac.in>
> wrote:
>
>> Dear Stefanos,
>>
>> I shall go through this thoroughly and let you know.
>>
>> Thanks a lot.
>> Sudakshina
>>
>> On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
>> stefanos.baziotis at gmail.com> wrote:
>>
>>> I think that you're interested in a more general tutorial on how to
>>> vectorize code. I'm sorry I'm going to plug in my own articles [1] [2], but
>>> I've just tried to write a short version here and I basically almost
>>> rewrote them so there was
>>> no point. For your case, I'd read at least the first half of "Residual
>>> Iterations" from Part 1 and most of part 2. The concept of residual
>>> iterations and reductions is something you'll see over and over and it
>>> happens
>>> in your example too. Let me explain the use of shufflevector pretty
>>> quick and then I'll explain some specifics in your example. Also, before we
>>> move any further, I recommend that we turn off loop unrolling in your
>>> example because it doesn't add educational value and just clutters the
>>> code, so here's your updated version [3].
>>>
>>> --- Shufflevector ---
>>>
>>> If you are not sure whether you understand what the instruction alone
>>> does, let me know. You can also play with this LLVM IR [4]. I recommend
>>> that you _run_ this code and change the mask in line 24.
>>> The reason it's used in your example (see [3]) is because the compiler
>>> wants to create a vector which has `arr[0]` (referring to the C++ entity)
>>> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line 17 it
>>> inserts
>>> it only to the 0th (first) lane of a vector (and for the time being, we
>>> don't care about the other lanes which is why the first arg to
>>> `insertelement` is `poison`) and in line 18, it shuffles the value. The
>>> shuffle mask is the vector [0, 0, 0, 0]
>>> (in short, zeroinitializer) which means in the 0th lane of the _result_
>>> vector (%minmax.ident.splat) insert the value of the 0th lane of the
>>> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the _result_
>>> vector, insert the value of the 0th lane of the _input_ vector etc. So,
>>> all lanes of the result vector get the 0th lane of the input vector.
>>>
>>> --- Residual Iterations ---
>>>
>>> Assuming that you have understood the use of residual iterations, let's
>>> see where it happens in your example. First of all, notice that in `entry`,
>>> there is a check whether `n` is bigger than 1, because if it's not, there
>>> are no
>>> iterations to do at all and in this case, the result, i.e., `max`, is
>>> `arr[0]`, which is why it's loaded so early. Check the phi at line 45.
>>> Then, if we make it to `for.body.preheader`, there is a check whether we
>>> have at least
>>> 4 iterations (the fact that this loop starts from one makes the
>>> computations somewhat more complicated but not a lot). Because if we don't,
>>> then there's no point going into the vectorized version. Then, skipping
>>> what the vectorized
>>> code does exactly for now (but check how the iteration count is rounded
>>> down to the nearest multiple of 4 in line 15), check that _after_ the
>>> vectorized code, the flow moves to `middle.block`,
>>> which essentially checks whether there are any more iterations to do and
>>> if so, jumps to the serial version of the loop (this is where the residual
>>> iterations happen).
>>>
>>> --- Reduction ---
>>>
>>> In the article, I explain reductions using accumulation as an example,
>>> but in your example there's another reduction, namely, finding the maximum
>>> element. Notice that this is also a reduction: new_value =
>>> max_of(old_value, data).
>>> The reduction variable (acting as both old and new value) is `max` in
>>> your example. The operation is max_of and the continuously new data is
>>> arr[i].
>>>
>>> Finding the maximum element is a commutative operation. For instance, to
>>> find the maximum element among 4, you can split them in two pairs _any_ way
>>> you want, find the maximum of each pair _independently_
>>> of the other pair (those two maximums are partial results and since
>>> their computation is independent of each other, it can be done in
>>> parallel), and finally find the maximum among these two maximums.
>>>
>>> What happens in the vectorized code is that each lane finds a partial
>>> maximum. In line 27 it loads 4 elements and in the next two lines, it
>>> "compares" them with the previous 4 elements (%vec.phi, which keeps the 4
>>> partial maximums which are updated
>>> in every iteration). It computes a vector which has 4 maximums, one for
>>> each pair of lanes (between the new 4 elements in %wide.load, in
>>> %wide.load, and %vec.phi). You might want to run an example in paper to see
>>> the effect. Consider that you want to compute
>>> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>>>
>>> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>.
>>> This is the starting value of %vec.phi. Then, you load 4 values: %wide.load
>>> = <10, 7, 3, 1>
>>> Then, you do an element-wise comparison of the two and for each pair of
>>> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2 but 1 < 2.
>>> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
>>> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and you
>>> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2. So, %5 =
>>> <10, 9, 11, 4>. Now, you have computed 4 partial maximums
>>> and you only need to find the maximum among those 4 to find the total
>>> maximum. That happens in line 35 and you're done.
>>>
>>> Hope this helps! Let me know if there are any questions.
>>>
>>> Best,
>>> Stefanos
>>>
>>> [1]
>>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
>>> [2]
>>> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
>>> [3] https://godbolt.org/z/nMe9a6YWa
>>> [4] https://godbolt.org/z/Wf9fTq53G
>>>
>>> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
>>> craig.topper at gmail.com> έγραψε:
>>>
>>>> Shufflevector is a powerful instruction that can be used to copy a
>>>> scalar to every elements of a vector, concatenate smaller vectors into a
>>>> larger vector, rearrange elements one or two vectors, etc. Without example
>>>> code I can't say for sure what the use of shufflevector in your case is for.
>>>>
>>>> ~Craig
>>>>
>>>>
>>>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <
>>>> sudakshina at iitgoa.ac.in> wrote:
>>>>
>>>>> Dear Craig,
>>>>>
>>>>> Thanks for your reply. It seems the control flow of the vectorized
>>>>> program is very different from that of the source code. Is there any
>>>>> document describing difference in control flow, the reason for using
>>>>> shufflevector, etc ?
>>>>>
>>>>> Sudakshina
>>>>>
>>>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There's some notes here https://llvm.org/docs/Vectorizers.html but I
>>>>>> didn't look too closely at it.  If you compile with -fno-discard-values
>>>>>> names or use a debug build, some of the basic blocks will be named similar
>>>>>> to the CFG diagram in this section
>>>>>> https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>>>
>>>>>> ~Craig
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via llvm-dev <
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> Dear Stefanos,
>>>>>>>
>>>>>>> I want to understand how the generated code works. For example, the
>>>>>>> code that I generated using -Rpass=loop-vectorize
>>>>>>> -Rpass-analysis=loop-vectorize has many shufflevector and other
>>>>>>> instructions. I wanted to have a document (if there exists any) where an
>>>>>>> example on loop-vectorization is done with some explanation on the workflow
>>>>>>> of the code.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sudakshina
>>>>>>>
>>>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sudakshina,
>>>>>>>>
>>>>>>>> First, it helps if you can put your code in a godbolt snippet, like
>>>>>>>> this [1]. It helps people in multiple ways (e.g., they don't have to
>>>>>>>> download files, they can see exactly what cmd arguments you used, they can
>>>>>>>> tweak the cmd arguments without having LLVM on their machine etc.).
>>>>>>>>
>>>>>>>> Is there any comprehensive tutorial/document to understand
>>>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>>
>>>>>>>>
>>>>>>>> This is quite generic, what is more specifically that you want to
>>>>>>>> understand? Do you want to understand what each individual instruction
>>>>>>>> does? Do you maybe understand that but
>>>>>>>> you don't know what is the general method to generate, let's say by
>>>>>>>> hand, vectorized code (or more specifically, branching vectorized code). Or
>>>>>>>> maybe, you want to understand
>>>>>>>> how _LLVM_ generates this code, i.e., the inner workings of the
>>>>>>>> vectorization passes.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefanos
>>>>>>>>
>>>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>>>
>>>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina Dutta via
>>>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> Greetings. I have generated a vectorized code from a C source file
>>>>>>>>> (attached). Is there any comprehensive tutorial/document to understand
>>>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Sudakshina
>>>>>>>>> _______________________________________________
>>>>>>>>> LLVM Developers mailing list
>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/3c8e9005/attachment-0001.html>


More information about the llvm-dev mailing list