[llvm-dev] Question about llvm vectors

Fri Aug 21 12:08:49 PDT 2020

__builtin_shufflevector was supposed to be linked here
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
but due to a mistake in the source file its generated from a link was made
to __builtin_shufflevector instead. I've fixed that and it should hopefully
update in the next day or two.

We have internal intrinsics for reduce_add that are used by the
autovectorizers. I could see it making sense to expose those to C as a
builtin. For X86 I think we always reduce at each stage by moving the upper
half of the vector to the lower half with a shuffle and then adding it to
the lower half. I think on some CPUs we use haddps/haddpd to do the last
stage of combining element 1 with element 0. But most CPUs we use a shuffle
and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to
implement haddps/haddpd. And on Intel CPUs there's only one execution unit
that can do the 2 shuffles. So they execute serially before the
addps/addpd. So for reductions it is better just emit a single shuffle in
assembly than to use haddps/pd.

~Craig

On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at gmail.com>
wrote:

> Hi Craig,
>
> Thank you very much for your answer.
>
> I did not want to discuss exactly the semantic and name of one operation
> but instead raise the question "would it be beneficial to have more vector
> builtins?".
>
> You wrote that the compiler will recognize a pattern and replace it by
> __builtin_ia32_haddps when possible, but how can I be sure of that? I would
> have to disassemble the generated code right? It is very
> impractical isn'it? And it leads me to understand that each CPU target has
> a bank of patterns which it can recognize but wouldn't it be very similar
> to have advanced generic vector operations and CPU specific implementation
> for those builtins?
>
> Regarding hadd; I agree, the name does not very well describe what it is
> doing. And yes hadd could be summing all the vector elements, but I think
> that the usual terminology for that is reduce_add.
>
> In my case I use it for computing the mono signal of a stereo interleaved
> signal:
>
> a = load(in);
> b = load(in + K);
> l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
> r = suffle(a, b, 1, 3, 5, 7, ...);
> m = .5 * (l + r); // m has the same size as a and b which is maybe optimal
> for memory I/O?
> store(m, out);
>
> As you said it, I could have m being half of the size of a, and I would
> not need to load b. Which approach would deliver the best performance? Does
> the compiler recognize both? Maybe there is another valid approach, will
> the compiler recognize it?
>
> I would like also to discuss reduce_add, there might be multiple ways of
> doing it right but is there one that is faster? Is the same approach always
> the best or it depends on the CPU? I believe that those questions are best
> answered by the compiler.
>
> Then some side-notes regarding clang documentation __builtin_shufflevector
> is not referenced there
> https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
>
> Best regards,
> Alexandre Bique
>
>
> On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com>
> wrote:
>
>> I'm not sure everyone would agree that the behavior of a
>> __builtin_vector_hadd should do what the X86 instruction does. It takes two
>> vectors and produces a result with elements from both vectors. Someone
>> might argue that a horizontal add should just take one source and produce a
>> vector with half the number of elements. Someone else might argue that a
>> horizontal add should sum all the elements to a single scalar value. With
>> different implementation choices like that its hard to say it should be a
>> generic operation when the behavior might only make sense for one target's
>> instruction set.
>>
>> The behavior of the 256-bit vhaddps instruction on X86 is also weird
>> since it treats the upper and lower 128-bits of the sources and destination
>> independently. That quirk wouldn't make sense in a generic operation.
>>
>> You can emulate __builtin_ia32_haddps generically using
>> __builtin_shufflevector and the + operator.  The X86 backend should
>> recognize it and use haddps.
>>
>> ~Craig
>>
>>
>> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> I love llvm vectors, yet I wonder why some advanced vector operations
>>> are specific to some CPU targets?
>>>
>>> Let me take an example:
>>>
>>> /// Horizontally adds the adjacent pairs of values contained in two
>>> ///    128-bit vectors of [4 x float].
>>> ///
>>> /// \headerfile <x86intrin.h>
>>> ///
>>> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction.
>>> ///
>>> /// \param __a
>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>> operands.
>>> ///    The horizontal sums of the values are stored in the lower bits of
>>> the
>>> ///    destination.
>>> /// \param __b
>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>> operands.
>>> ///    The horizontal sums of the values are stored in the upper bits of
>>> the
>>> ///    destination.
>>> /// \returns A 128-bit vector of [4 x float] containing the horizontal
>>> sums of
>>> ///    both operands.
>>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>>> _mm_hadd_ps(__m128 __a, __m128 __b)
>>> {
>>>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>>> }
>>>
>>> Here clang will translate _mm_hadd_ps to a CPU specific feature.
>>> Why not create __builtin_vector_hadd(a, b) which would select the CPU
>>> specific instruction or a fallback generic implementation?
>>>
>>> Many thanks,
>>> Alex
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/54cb8fbc/attachment.html>