[llvm-dev] X86 Intrinsics : _mm_storel_epi64/ _mm_loadl_epi64 with -m32

Thu May 24 11:06:03 PDT 2018

Hi,

I’m using _mm_storel_epi64/ _mm_loadl_epi64 in my test case as below
and generating 32-bit code (using -m32 and -msse4.2). The 64-bit load
and 64-bit store operations are replaced with two 32-bit mov
instructions, presumably due to the use of uint64_t type. If I use
__m128i instead of uint64_t everywhere, then the read and write happen
as 64-bit operations using the xmm registers as expected.

void indvbl_write64(volatile void *p, uint64_t v)
{
       __m128i tmp = _mm_loadl_epi64((__m128i const *)&v);
       _mm_storel_epi64((__m128i *)p, tmp);
}

uint64_t indivbl_read64 (volatile void *p)
{
        __m128i tmp = _mm_loadl_epi64((__m128i const *)p);
        return *(uint64_t *)&tmp;
}

Options used to compile: clang –O2 –c –msse4.2 –m32 test.c

Generated code:

00000000 <indvbl_write64>:
   0:   8b 44 24 08           mov    0x8(%esp),%eax
   4:   8b 54 24 04           mov    0x4(%esp),%edx
   8:   8b 4c 24 0c           mov    0xc(%esp),%ecx
   c:   89 4a 04                mov    %ecx,0x4(%edx)
   f:    89 02                     mov    %eax,(%edx)
  11:   c3                          ret
  12:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%eax,%eax,1)
  19:   00 00 00
  1c:   0f 1f 40 00             nopl   0x0(%eax)

00000020 <indvbl_read64>:
  20:   8b 4c 24 04         mov    0x4(%esp),%ecx
  24:   8b 01                   mov    (%ecx),%eax
  26:   8b 51 04              mov    0x4(%ecx),%edx
  29:   c3                        ret

The front-end generates insertelement <2 x i64> and extractelement <2
x i64> for the load and stores as expected and optimizer generates
load i64 and store i64, which are then lowered into 32-bit move
instructions in the Instruction Selection Phase.

Would it be possible and safe to generate a single 64-bit load/store
in this case with –m32 ? If so, please may I have some pointers to
related parts of the code I should be looking at to make this
improvement.

Thanks,
Bharathi