[llvm-dev] Document to understand vectorized code

Sudakshina Dutta via llvm-dev llvm-dev at lists.llvm.org
Tue Jun 8 05:14:50 PDT 2021


Dear Stefanos,

I shall go through this thoroughly and let you know.

Thanks a lot.
Sudakshina

On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
stefanos.baziotis at gmail.com> wrote:

> I think that you're interested in a more general tutorial on how to
> vectorize code. I'm sorry I'm going to plug in my own articles [1] [2], but
> I've just tried to write a short version here and I basically almost
> rewrote them so there was
> no point. For your case, I'd read at least the first half of "Residual
> Iterations" from Part 1 and most of part 2. The concept of residual
> iterations and reductions is something you'll see over and over and it
> happens
> in your example too. Let me explain the use of shufflevector pretty quick
> and then I'll explain some specifics in your example. Also, before we move
> any further, I recommend that we turn off loop unrolling in your
> example because it doesn't add educational value and just clutters the
> code, so here's your updated version [3].
>
> --- Shufflevector ---
>
> If you are not sure whether you understand what the instruction alone
> does, let me know. You can also play with this LLVM IR [4]. I recommend
> that you _run_ this code and change the mask in line 24.
> The reason it's used in your example (see [3]) is because the compiler
> wants to create a vector which has `arr[0]` (referring to the C++ entity)
> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line 17 it
> inserts
> it only to the 0th (first) lane of a vector (and for the time being, we
> don't care about the other lanes which is why the first arg to
> `insertelement` is `poison`) and in line 18, it shuffles the value. The
> shuffle mask is the vector [0, 0, 0, 0]
> (in short, zeroinitializer) which means in the 0th lane of the _result_
> vector (%minmax.ident.splat) insert the value of the 0th lane of the
> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the _result_
> vector, insert the value of the 0th lane of the _input_ vector etc. So,
> all lanes of the result vector get the 0th lane of the input vector.
>
> --- Residual Iterations ---
>
> Assuming that you have understood the use of residual iterations, let's
> see where it happens in your example. First of all, notice that in `entry`,
> there is a check whether `n` is bigger than 1, because if it's not, there
> are no
> iterations to do at all and in this case, the result, i.e., `max`, is
> `arr[0]`, which is why it's loaded so early. Check the phi at line 45.
> Then, if we make it to `for.body.preheader`, there is a check whether we
> have at least
> 4 iterations (the fact that this loop starts from one makes the
> computations somewhat more complicated but not a lot). Because if we don't,
> then there's no point going into the vectorized version. Then, skipping
> what the vectorized
> code does exactly for now (but check how the iteration count is rounded
> down to the nearest multiple of 4 in line 15), check that _after_ the
> vectorized code, the flow moves to `middle.block`,
> which essentially checks whether there are any more iterations to do and
> if so, jumps to the serial version of the loop (this is where the residual
> iterations happen).
>
> --- Reduction ---
>
> In the article, I explain reductions using accumulation as an example, but
> in your example there's another reduction, namely, finding the maximum
> element. Notice that this is also a reduction: new_value =
> max_of(old_value, data).
> The reduction variable (acting as both old and new value) is `max` in your
> example. The operation is max_of and the continuously new data is arr[i].
>
> Finding the maximum element is a commutative operation. For instance, to
> find the maximum element among 4, you can split them in two pairs _any_ way
> you want, find the maximum of each pair _independently_
> of the other pair (those two maximums are partial results and since their
> computation is independent of each other, it can be done in parallel), and
> finally find the maximum among these two maximums.
>
> What happens in the vectorized code is that each lane finds a partial
> maximum. In line 27 it loads 4 elements and in the next two lines, it
> "compares" them with the previous 4 elements (%vec.phi, which keeps the 4
> partial maximums which are updated
> in every iteration). It computes a vector which has 4 maximums, one for
> each pair of lanes (between the new 4 elements in %wide.load, in
> %wide.load, and %vec.phi). You might want to run an example in paper to see
> the effect. Consider that you want to compute
> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>
> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>. This
> is the starting value of %vec.phi. Then, you load 4 values: %wide.load =
> <10, 7, 3, 1>
> Then, you do an element-wise comparison of the two and for each pair of
> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2 but 1 < 2.
> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and you
> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2. So, %5 =
> <10, 9, 11, 4>. Now, you have computed 4 partial maximums
> and you only need to find the maximum among those 4 to find the total
> maximum. That happens in line 35 and you're done.
>
> Hope this helps! Let me know if there are any questions.
>
> Best,
> Stefanos
>
> [1]
> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
> [2]
> http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
> [3] https://godbolt.org/z/nMe9a6YWa
> [4] https://godbolt.org/z/Wf9fTq53G
>
> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
> craig.topper at gmail.com> έγραψε:
>
>> Shufflevector is a powerful instruction that can be used to copy a scalar
>> to every elements of a vector, concatenate smaller vectors into a larger
>> vector, rearrange elements one or two vectors, etc. Without example code I
>> can't say for sure what the use of shufflevector in your case is for.
>>
>> ~Craig
>>
>>
>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <sudakshina at iitgoa.ac.in>
>> wrote:
>>
>>> Dear Craig,
>>>
>>> Thanks for your reply. It seems the control flow of the vectorized
>>> program is very different from that of the source code. Is there any
>>> document describing difference in control flow, the reason for using
>>> shufflevector, etc ?
>>>
>>> Sudakshina
>>>
>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at gmail.com> wrote:
>>>
>>>> There's some notes here https://llvm.org/docs/Vectorizers.html but I
>>>> didn't look too closely at it.  If you compile with -fno-discard-values
>>>> names or use a debug build, some of the basic blocks will be named similar
>>>> to the CFG diagram in this section
>>>> https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>
>>>> ~Craig
>>>>
>>>>
>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Dear Stefanos,
>>>>>
>>>>> I want to understand how the generated code works. For example, the
>>>>> code that I generated using -Rpass=loop-vectorize
>>>>> -Rpass-analysis=loop-vectorize has many shufflevector and other
>>>>> instructions. I wanted to have a document (if there exists any) where an
>>>>> example on loop-vectorization is done with some explanation on the workflow
>>>>> of the code.
>>>>>
>>>>> Thanks,
>>>>> Sudakshina
>>>>>
>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>
>>>>>> Hi Sudakshina,
>>>>>>
>>>>>> First, it helps if you can put your code in a godbolt snippet, like
>>>>>> this [1]. It helps people in multiple ways (e.g., they don't have to
>>>>>> download files, they can see exactly what cmd arguments you used, they can
>>>>>> tweak the cmd arguments without having LLVM on their machine etc.).
>>>>>>
>>>>>> Is there any comprehensive tutorial/document to understand generated
>>>>>>> instructions or the semantics of the vectorized code ?
>>>>>>
>>>>>>
>>>>>> This is quite generic, what is more specifically that you want to
>>>>>> understand? Do you want to understand what each individual instruction
>>>>>> does? Do you maybe understand that but
>>>>>> you don't know what is the general method to generate, let's say by
>>>>>> hand, vectorized code (or more specifically, branching vectorized code). Or
>>>>>> maybe, you want to understand
>>>>>> how _LLVM_ generates this code, i.e., the inner workings of the
>>>>>> vectorization passes.
>>>>>>
>>>>>> Best,
>>>>>> Stefanos
>>>>>>
>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>
>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina Dutta via
>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> Greetings. I have generated a vectorized code from a C source file
>>>>>>> (attached). Is there any comprehensive tutorial/document to understand
>>>>>>> generated instructions or the semantics of the vectorized code ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sudakshina
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210608/1ed92efe/attachment.html>


More information about the llvm-dev mailing list