[cfe-dev] Inefficient code generation for _mm_test{z, c, nzc} (SSE4.1)

Wed Apr 11 02:41:35 PDT 2012

Hi

I've stumbled over a deficiency in clang's codegen for the SSE4.1 _mm_test* intrinsics. These intrinsics are supposed to map to the PTEST instruction, which sets the ZF (zero flag) and CF (carry flag) depending on whether the bitwise AND (or ANDNOT for CF) of two SSE registers is all zero or not. The construct

  if (_mm_test{z,c,nzc}_si128(v, m))
    …

should thus produce a PTEST instruction followed by a branch instruction (JZ for _mm_testz_si128, JC fr _mm_testc_si128, JNBE for _mm_testnzc_si128). Clang, however, instead produces something like

  PTEST …
  SETE %al
  MOVZBL %al, %eax
  TEST %eax, %eax
  JNE ...

Also, the LLVM bitcode looks a tad strange. For

  if (_mm_testz_si128(v,v))
    body();

Clang generates

  %2 = tail call i32 @llvm.x86.sse41.ptestz(<4 x float> %1, <4 x float> %1) nounwind
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %5, label %4
; <label>:4                                       ; preds = %0
  tail call void (...)* @body() nounwind
  br label %5
; <label>:5                                       ; preds = %4, %0
  ret void

Since _mm_testz_si128 uses __m128i (the integer SSE type), *not* __m128 (the single-precision float SSE type), it seems strange that the corresponding LLVM intrinsic takes parameters of type float.

I'm not sure whether fixing this involves changing Clang or LLVM (or both?), which is why I haven't filed a bug report so far, but instead posted this here.

Funnily enough, GCC 4.2 (at least the OSX version) has the same problem. Later GCC versions get it right, though.

best regards,
Florian Pflug