[cfe-dev] An issue about re-implementing the AVX2 intrinsic using inline ASM

Sun Sep 20 22:42:10 PDT 2015

Greetings  everyone. Please allow me to illustrate my problem here:
First, please consider the following sample code:

#include <stdio.h>
#include <iostream>
#include <vector>
#include <immintrin.h>
using namespace std;
int main(int argc, char const *argv[])
{
            __m256i x ,y ;
            __m256i res = _mm256_and_si256(x, y);
            return 0;
}

It can be compiled easily using clang -mavx2 source.cc

And we now want to re-write this _mm256_and_si256 function using inline ASM, just like the following:
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
typedef float __m256 __attribute__ ((__vector_size__ (32)));
typedef double __m256d __attribute__((__vector_size__(32)));
typedef long long __m256i __attribute__((__vector_size__(32)));

typedef long long __v4di __attribute__ ((__vector_size__ (32)));
typedef int __v8si __attribute__ ((__vector_size__ (32)));
typedef short __v16hi __attribute__ ((__vector_size__ (32)));
typedef char __v32qi __attribute__ ((__vector_size__ (32)));
__attribute__((always_inline)) inline
__m256i _my_mm256_and_si256(__m256i s1, __m256i s2)
{
    __m256i result;
    __asm__ ("vpand %2, %1, %0" : "=x"(result) : "x"(s1), "xm"(s2) );
    return result;
}

int main(int argc, char const *argv[])
{
            __m256i x ,y ;
            __m256i res = _my_mm256_and_si256(x, y );
            return 0;
}

This new code can be compiled well also using clang -mavx2 source.cc
However, if we remove the -mavx2 flag, clang will emit the error:

fatal error: error in backend: Do not know how to split the result of this operator!
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)

Someone has given me an explanation here saying
If we miss the -mavx4.2 flag, the clang/llvm is unable to determine the right machine target to bind the input memory parameter to the input register required by the vpand operator here since the vpand operator requires ymm[0..7] as its input/output.

This makes some sense and I guess the gcc error output 20 : error: impossible constraint in 'asm' are actually complaining the similar thing.  However, this doesn't explain why the sse4.2 asm code can be compiled without the -msse4.2 flag. So please allow me to show you more here:

#include <stdio.h>
#include <iostream>
#include <vector>
#include <stdint.h>
#include <emmintrin.h>
using namespace std;

static  inline __attribute__ ((__always_inline__))
int new_cmpestri(
    __m128i str1, int len1, __m128i str2, int len2, const int mode) {
  int result;
  __asm__("pcmpestri %5, %2, %1"
      : "=c"(result) : "x"(str1), "xm"(str2), "a"(len1), "d"(len2), "i"(mode) : "cc");
  return result;
}
int main(int argc, char const *argv[])
{
            __m128i str1;
            int len1 = 0;
            __m128i str2;
            int len2 =0;
            const int mode  = 0;
            uint32_t result = new_cmpestri(str1, len1, str2, len2, mode);

            return 0;
}

And the CPUID Flags of pcmpestri is SSE4.2. But this code can be compiled well without -msse4.2 flag.
I have conducted experiments with both gcc 4.9.2 and clang 3.3.

So in brief, I have two questions:

1.      Is it a possible task to compile the AVX2 ASM without -mavx flag using clang?

2.      If the answer to question 1 is NO, then why we can do that for SSE4.2 ASM without -msse4.2 flag?

Thank you very much for taking time reading this letter!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150921/fa31852e/attachment.html>