[llvm-dev] Question about llvm vectors

Fri Aug 21 13:31:34 PDT 2020

Thank you very much for the explanation.

I have one more question: it is possible in LLVM IR to call sin() on a
vector. Yet I did not find how to do it with clang and I've tried various
things:

#include <cmath>

using vec = float __attribute__((__vector_size__(4 * 4)));

vec fct(vec a)
{
  vec b = std::exp(a);
  //vec b = __builtin_exp(a);
  //vec b{std::exp(a[0]), std::exp(a[1]), std::exp(a[2]), std::exp(a[3])};
  //vec b{__builtin_expf(a[0]), __builtin_expf(a[1]), __builtin_expf(a[2]),
__builtin_expf(a[3])};
  return b;
}

Do you know how to do that?

Regards,
Alexandre Bique

On Fri, Aug 21, 2020 at 9:09 PM Craig Topper <craig.topper at gmail.com> wrote:

> __builtin_shufflevector was supposed to be linked here
> https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
> but due to a mistake in the source file its generated from a link was made
> to __builtin_shufflevector instead. I've fixed that and it should hopefully
> update in the next day or two.
>
> We have internal intrinsics for reduce_add that are used by the
> autovectorizers. I could see it making sense to expose those to C as a
> builtin. For X86 I think we always reduce at each stage by moving the upper
> half of the vector to the lower half with a shuffle and then adding it to
> the lower half. I think on some CPUs we use haddps/haddpd to do the last
> stage of combining element 1 with element 0. But most CPUs we use a shuffle
> and a addps/addpd. Intel CPUs use 2 shuffles and addps/addpd internally to
> implement haddps/haddpd. And on Intel CPUs there's only one execution unit
> that can do the 2 shuffles. So they execute serially before the
> addps/addpd. So for reductions it is better just emit a single shuffle in
> assembly than to use haddps/pd.
>
> ~Craig
>
>
> On Thu, Aug 20, 2020 at 2:17 AM Alexandre Bique <bique.alexandre at gmail.com>
> wrote:
>
>> Hi Craig,
>>
>> Thank you very much for your answer.
>>
>> I did not want to discuss exactly the semantic and name of one operation
>> but instead raise the question "would it be beneficial to have more vector
>> builtins?".
>>
>> You wrote that the compiler will recognize a pattern and replace it by
>> __builtin_ia32_haddps when possible, but how can I be sure of that? I would
>> have to disassemble the generated code right? It is very
>> impractical isn'it? And it leads me to understand that each CPU target has
>> a bank of patterns which it can recognize but wouldn't it be very similar
>> to have advanced generic vector operations and CPU specific implementation
>> for those builtins?
>>
>> Regarding hadd; I agree, the name does not very well describe what it is
>> doing. And yes hadd could be summing all the vector elements, but I think
>> that the usual terminology for that is reduce_add.
>>
>> In my case I use it for computing the mono signal of a stereo interleaved
>> signal:
>>
>> a = load(in);
>> b = load(in + K);
>> l = suffle(a, b, 0, 2, 4, 6, ...); // l and r have the same size as a
>> r = suffle(a, b, 1, 3, 5, 7, ...);
>> m = .5 * (l + r); // m has the same size as a and b which is maybe
>> optimal for memory I/O?
>> store(m, out);
>>
>> As you said it, I could have m being half of the size of a, and I would
>> not need to load b. Which approach would deliver the best performance? Does
>> the compiler recognize both? Maybe there is another valid approach, will
>> the compiler recognize it?
>>
>> I would like also to discuss reduce_add, there might be multiple ways of
>> doing it right but is there one that is faster? Is the same approach always
>> the best or it depends on the CPU? I believe that those questions are best
>> answered by the compiler.
>>
>> Then some side-notes regarding clang
>> documentation __builtin_shufflevector is not referenced there
>> https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors
>>
>> Best regards,
>> Alexandre Bique
>>
>>
>> On Wed, Aug 19, 2020 at 8:34 PM Craig Topper <craig.topper at gmail.com>
>> wrote:
>>
>>> I'm not sure everyone would agree that the behavior of a
>>> __builtin_vector_hadd should do what the X86 instruction does. It takes two
>>> vectors and produces a result with elements from both vectors. Someone
>>> might argue that a horizontal add should just take one source and produce a
>>> vector with half the number of elements. Someone else might argue that a
>>> horizontal add should sum all the elements to a single scalar value. With
>>> different implementation choices like that its hard to say it should be a
>>> generic operation when the behavior might only make sense for one target's
>>> instruction set.
>>>
>>> The behavior of the 256-bit vhaddps instruction on X86 is also weird
>>> since it treats the upper and lower 128-bits of the sources and destination
>>> independently. That quirk wouldn't make sense in a generic operation.
>>>
>>> You can emulate __builtin_ia32_haddps generically using
>>> __builtin_shufflevector and the + operator.  The X86 backend should
>>> recognize it and use haddps.
>>>
>>> ~Craig
>>>
>>>
>>> On Wed, Aug 19, 2020 at 10:54 AM Alexandre Bique via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> I love llvm vectors, yet I wonder why some advanced vector operations
>>>> are specific to some CPU targets?
>>>>
>>>> Let me take an example:
>>>>
>>>> /// Horizontally adds the adjacent pairs of values contained in two
>>>> ///    128-bit vectors of [4 x float].
>>>> ///
>>>> /// \headerfile <x86intrin.h>
>>>> ///
>>>> /// This intrinsic corresponds to the <c> VHADDPS </c> instruction.
>>>> ///
>>>> /// \param __a
>>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>>> operands.
>>>> ///    The horizontal sums of the values are stored in the lower bits
>>>> of the
>>>> ///    destination.
>>>> /// \param __b
>>>> ///    A 128-bit vector of [4 x float] containing one of the source
>>>> operands.
>>>> ///    The horizontal sums of the values are stored in the upper bits
>>>> of the
>>>> ///    destination.
>>>> /// \returns A 128-bit vector of [4 x float] containing the horizontal
>>>> sums of
>>>> ///    both operands.
>>>> static __inline__ __m128 __DEFAULT_FN_ATTRS
>>>> _mm_hadd_ps(__m128 __a, __m128 __b)
>>>> {
>>>>   return __builtin_ia32_haddps((__v4sf)__a, (__v4sf)__b);
>>>> }
>>>>
>>>> Here clang will translate _mm_hadd_ps to a CPU specific feature.
>>>> Why not create __builtin_vector_hadd(a, b) which would select the CPU
>>>> specific instruction or a fallback generic implementation?
>>>>
>>>> Many thanks,
>>>> Alex
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200821/9fdc7e6d/attachment.html>