[PATCH] D18566: [x86] use SSE/AVX ops for non-zero memsets (PR27100)

Tue Mar 29 13:11:09 PDT 2016

zansari added a comment.

  The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.

This would be my biggest concern out of all of the other ugliness.
I agree that the imul is ugly, but so is all of the other extra code generated to broadcast the byte into an xmm. I'm guessing that this is why the "zeromemset" guard was put there, specifically, to allow memsets with cheap immediates through.

It looks like the code that expands the memset is pretty inefficient. This is also what I see with a memset(m, v, 16) :

  movzbl	4(%esp), %ecx
  movl	$16843009, %edx         # imm = 0x1010101
  movl	%ecx, %eax
  mull	%edx
  movd	%eax, %xmm0
  imull	$16843009, %ecx, %eax   # imm = 0x1010101
  addl	%edx, %eax
  movd	%eax, %xmm1
  punpckldq	%xmm1, %xmm0    # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
  movq	%xmm0, a+8
  movq	%xmm0, a

.. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.

I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Thanks,
Zia.

http://reviews.llvm.org/D18566