[llvm-dev] Question about llvm vectors

Wed Aug 19 11:34:19 PDT 2020

I'm not sure everyone would agree that the behavior of a
__builtin_vector_hadd should do what the X86 instruction does. It takes two
vectors and produces a result with elements from both vectors. Someone
might argue that a horizontal add should just take one source and produce a
vector with half the number of elements. Someone else might argue that a
horizontal add should sum all the elements to a single scalar value. With
different implementation choices like that its hard to say it should be a
generic operation when the behavior might only make sense for one target's
instruction set.

The behavior of the 256-bit vhaddps instruction on X86 is also weird since
it treats the upper and lower 128-bits of the sources and destination
independently. That quirk wouldn't make sense in a generic operation.

You can emulate __builtin_ia32_haddps generically using
__builtin_shufflevector and the + operator.  The X86 backend should
recognize it and use haddps.

~Craig

On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi,
>
> I love llvm vectors, yet I wonder why some advanced vector operations are
> specific to some CPU targets?
>
> Let me take an example:
>
> /// Horizontally adds the adjacent pairs of values contained in two
> ///    128-bit vectors of [4 x float].
> ///
> /// \headerfile <x86intrin.h>
> ///
> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction.
> ///
> /// \param __a
> ///    A 128-bit vector of [4 x float] containing one of the source
> operands.
> ///    The horizontal sums of the values are stored in the lower bits of
> the
> ///    destination.
> /// \param __b
> ///    A 128-bit vector of [4 x float] containing one of the source
> operands.
> ///    The horizontal sums of the values are stored in the upper bits of
> the
> ///    destination.
> /// \returns A 128-bit vector of [4 x float] containing the horizontal
> sums of
> ///    both operands.
> static __inline__ __m128 __DEFAULT_FN_ATTRS
> _mm_hadd_ps(__m128 __a, __m128 __b)
> {
>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
> }
>
> Here clang will translate _mm_hadd_ps to a CPU specific feature.
> Why not create __builtin_vector_hadd(a, b) which would select the CPU
> specific instruction or a fallback generic implementation?
>
> Many thanks,
> Alex
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200819/f5d3e02c/attachment.html>