[PATCH] D13776: [x86] Fix wrong prototypes for AVX mask load/store intrinsics.

Fri Oct 16 04:56:11 PDT 2015

andreadb added a comment.

In http://reviews.llvm.org/D13776#268072, @bruno wrote:

> Hi Andrea,
>
> I don't recall off hand but my guess here is that the mask is a vector float type because VMASKMOV* uses two FP ports. In practice I'm not sure what the side effects of changing this are, but it might be something worth considering.

Hi Bruno,

You are right, VMASKMOV* are definitely floating point domain. Depending on the subtarget and whether the VMASKMOV is a load or a store it may use one or more ports.

That said, in practice the only side effect in changing those intrinsics is that we end up with an extra bitcast in the case where the mask type in input is a vector of packed floats. However, that bitcast would be equivalent to a bitconvert between types of the same register class. So, it would be no-op and no extra instructions would be generated (tested on small examples using both constant and non-constant mask values).

Since VMASKMOV is floating point domain, AMD chips would suffer for a potential stall if the Mask value originated in the integer domain. A domain crossing caused by data movement (example: VInt -> VFP) is 1cy penalty on AMD chips.
However, the domain crossing issue is not a problem that would be introduced by this change. That issue was already affecting the previous intrinsic definition (i.e. the source of this "problem" has nothing to do with this change).
The backend already knows how to mitigate this problem running the "execution dependency fix" pass.

Example:

  define void @foo(<4 x float>* %dst, <4 x float> %InVec, <4 x i32> %Mask1, <4 x i32> %Mask2) {
    %xor = xor <4 x i32> %Mask1, %Mask2
    %0 = bitcast <4 x float>* %dst to i8*
    tail call void @llvm.x86.avx.maskstore.ps(i8* %0, <4 x i32> %xor, <4 x float> %InVec)
    ret void
  }

In this example, ISel would select a VPXORrr for the 'xor' operation.
Before code emission, we end up with the sequence:

  %XMM1<def> = VPXORrr %XMM1<kill>, %XMM2<kill>
  VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

After 'exedep-fix' we have:

  %XMM1<def> = VXORPSrr %XMM1<kill>, %XMM2<kill>
  VMASKMOVPSrm %RDI<kill>, 1, noreg, 0, %noreg, %XMM1<kill>, %XMM0<kill>

As a developer, it comes more natural to think about the mask as a vector of integers rather than floats. For example, we may want to manipulate the mask with logical operators before passing it in input to (V)MASKMOV* instructions. Having to pass it as a float simply forces to insert an explicitly (no-op) cast.

That said, I don't know how much we care about being consistent with gcc. Gcc defines those builtins differently than us (i.e. always as vector int/long values). This "problem" was found internally when testing the codegen of intrinsic calls. We spotted this discrepancy and that's the main reason why I uploaded this patch.
At the very least I suggest  that we fix our intrinsic definitions in avxintrin.h just to be consistent with what the Intel documentation says.

What do you think?

http://reviews.llvm.org/D13776