[PATCH] D18566: [x86] use SSE/AVX ops for non-zero memsets (PR27100)

Tue Mar 29 16:11:52 PDT 2016

spatel added a comment.

In http://reviews.llvm.org/D18566#386084, @zansari wrote:

> .. in addition to all the gackiness, also notice that we're only doing 8B stores after all of that.
>
> I like the change, but any chance we could fix this issue before committing this change? We should really only be generating a couple of shifts/ors and a shuffle, followed by full 16B stores.

Clearly, the one-line patch was too ambitious. :)

What we're seeing in some of these changes is that we're hitting what I hope is a weird corner case: a slow unaligned SSE store implementation (ie, before SSE4.2) with a 32-bit OS. On 2nd thought, maybe that's not so weird.

In any case, I will fix the patch to preserve that existing behavior. By just loosening the restriction on the non-zero memset for fast CPUs, we'll avoid the strange codegen and still get the benefits shown in PR27100.

http://reviews.llvm.org/D18566