r205436 - Extend the SSE2 comment lexing to AVX2. Only 16byte align when not on AVX2.

Thu Apr 3 10:38:20 PDT 2014

On Thu, Apr 03, 2014 at 10:13:15AM +0100, Jay Foad wrote:
> Hi Roman,
> 
> On 2 April 2014 18:27, Roman Divacky <rdivacky at freebsd.org> wrote:
> >  #ifdef __SSE2__
> > -      __m128i Slashes = _mm_set1_epi8('/');
> > -      while (CurPtr+16 <= BufferEnd) {
> > -        int cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(*(const __m128i*)CurPtr,
> > -                                    Slashes));
> > +#define VECTOR_TYPE             __m128i
> > +#define SET1_EPI8(v)            _mm_set1_epi8(v)
> > +#define CMPEQ_EPI8(v1,v2)       _mm_cmpeq_epi8(v1,v2)
> > +#define MOVEMASK_EPI8(v)        _mm_movemask_epi8(v)
> > +#define STEP                    16
> > +#elif __AVX2__
> > +#define VECTOR_TYPE             __m256i
> > +#define SET1_EPI8(v)            _mm256_set1_epi8(v)
> > +#define CMPEQ_EPI8(v1,v2)       _mm256_cmpeq_epi8(v1,v2)
> > +#define MOVEMASK_EPI8(v)        _mm256_movemask_epi8(v)
> > +#define STEP                    32
> > +#endif
> 
> Surely any machine with AVX2 also has SSE2, and if both are defined
> then your code will prefer to use the SSE2 intrinsics. This doesn't
> seem right. Am I missing something?

You're absolutely right. I fixed that and rebenchmarked and now
there's no difference at all. I have no explanation for the previous
3% speedup (but it was proved at 95% significance over 10 samples).

I'll revert my commit. Sorry for the noise.

Roman