[llvm] r262269 - [x86, InstCombine] transform x86 AVX masked loads to LLVM intrinsics

Tue Apr 12 08:53:01 PDT 2016

Yes, I misread the manual. Working on the fix now. Thanks for letting me
know!

On Tue, Apr 12, 2016 at 8:49 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com
> wrote:

> Hi Sanjay,
>
> We noticed a regression after this commit which seems to be caused by an
> incorrect folding of a x86 masked loads intrinsic with a zero mask.
>
> According to the Intel documentation:
> "
> The mask is calculated from the most significant bit of each dword of the
> mask register. If any of the bits of the mask is set to zero, the
> corresponding value from the memory location is not loaded, and the
> corresponding field of the destination vector is set to zero.
> "
>
> I suspect that the problem is caused by the following incorrect check:
>
> +  // Special case a zero mask since that's not a ConstantDataVector.
> +  // This masked load instruction does nothing, so return an undef.
> +  if (isa<ConstantAggregateZero>(Mask))
> +    return IC.replaceInstUsesWith(II, UndefValue::get(II.getType()));
> +
>
> The behavior of a avx masked load intrinsic with a zero mask is very well
> defined. Instead of propagating Undef, you should have propagated a zero
> vector.
>
> Can you please have a look at it?
>
> Thanks in advance,
> -Andrea
>
> On Mon, Feb 29, 2016 at 11:16 PM, Sanjay Patel via llvm-commits <
> llvm-commits at lists.llvm.org> wrote:
>
>> Author: spatel
>> Date: Mon Feb 29 17:16:48 2016
>> New Revision: 262269
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=262269&view=rev
>> Log:
>> [x86, InstCombine] transform x86 AVX masked loads to LLVM intrinsics
>>
>> The intended effect of this patch in conjunction with:
>> http://reviews.llvm.org/rL259392
>> http://reviews.llvm.org/rL260145
>>
>> is that customers using the AVX intrinsics in C will benefit from
>> combines when
>> the load mask is constant:
>>
>> __m128 mload_zeros(float *f) {
>>   return _mm_maskload_ps(f, _mm_set1_epi32(0));
>> }
>>
>> __m128 mload_fakeones(float *f) {
>>   return _mm_maskload_ps(f, _mm_set1_epi32(1));
>> }
>>
>> __m128 mload_ones(float *f) {
>>   return _mm_maskload_ps(f, _mm_set1_epi32(0x80000000));
>> }
>>
>> __m128 mload_oneset(float *f) {
>>   return _mm_maskload_ps(f, _mm_set_epi32(0x80000000, 0, 0, 0));
>> }
>>
>> ...so none of the above will actually generate a masked load for
>> optimized code.
>>
>> This is the masked load counterpart to:
>> http://reviews.llvm.org/rL262064
>>
>>
>> Modified:
>>     llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp
>>     llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll
>>
>> Modified: llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp?rev=262269&r1=262268&r2=262269&view=diff
>>
>> ==============================================================================
>> --- llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp (original)
>> +++ llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp Mon Feb 29
>> 17:16:48 2016
>> @@ -831,6 +831,39 @@ static Instruction *simplifyMaskedScatte
>>  // TODO: If the x86 backend knew how to convert a bool vector mask back
>> to an
>>  // XMM register mask efficiently, we could transform all x86 masked
>> intrinsics
>>  // to LLVM masked intrinsics and remove the x86 masked intrinsic defs.
>> +static Instruction *simplifyX86MaskedLoad(IntrinsicInst &II,
>> InstCombiner &IC) {
>> +  Value *Ptr = II.getOperand(0);
>> +  Value *Mask = II.getOperand(1);
>> +
>> +  // Special case a zero mask since that's not a ConstantDataVector.
>> +  // This masked load instruction does nothing, so return an undef.
>> +  if (isa<ConstantAggregateZero>(Mask))
>> +    return IC.replaceInstUsesWith(II, UndefValue::get(II.getType()));
>> +
>> +  auto *ConstMask = dyn_cast<ConstantDataVector>(Mask);
>> +  if (!ConstMask)
>> +    return nullptr;
>> +
>> +  // The mask is constant. Convert this x86 intrinsic to the LLVM
>> instrinsic
>> +  // to allow target-independent optimizations.
>> +
>> +  // First, cast the x86 intrinsic scalar pointer to a vector pointer to
>> match
>> +  // the LLVM intrinsic definition for the pointer argument.
>> +  unsigned AddrSpace =
>> cast<PointerType>(Ptr->getType())->getAddressSpace();
>> +  PointerType *VecPtrTy = PointerType::get(II.getType(), AddrSpace);
>> +  Value *PtrCast = IC.Builder->CreateBitCast(Ptr, VecPtrTy, "castvec");
>> +
>> +  // Second, convert the x86 XMM integer vector mask to a vector of
>> bools based
>> +  // on each element's most significant bit (the sign bit).
>> +  Constant *BoolMask = getNegativeIsTrueBoolVec(ConstMask);
>> +
>> +  CallInst *NewMaskedLoad = IC.Builder->CreateMaskedLoad(PtrCast, 1,
>> BoolMask);
>> +  return IC.replaceInstUsesWith(II, NewMaskedLoad);
>> +}
>> +
>> +// TODO: If the x86 backend knew how to convert a bool vector mask back
>> to an
>> +// XMM register mask efficiently, we could transform all x86 masked
>> intrinsics
>> +// to LLVM masked intrinsics and remove the x86 masked intrinsic defs.
>>  static bool simplifyX86MaskedStore(IntrinsicInst &II, InstCombiner &IC) {
>>    Value *Ptr = II.getOperand(0);
>>    Value *Mask = II.getOperand(1);
>> @@ -854,7 +887,6 @@ static bool simplifyX86MaskedStore(Intri
>>    // the LLVM intrinsic definition for the pointer argument.
>>    unsigned AddrSpace =
>> cast<PointerType>(Ptr->getType())->getAddressSpace();
>>    PointerType *VecPtrTy = PointerType::get(Vec->getType(), AddrSpace);
>> -
>>    Value *PtrCast = IC.Builder->CreateBitCast(Ptr, VecPtrTy, "castvec");
>>
>>    // Second, convert the x86 XMM integer vector mask to a vector of
>> bools based
>> @@ -1630,6 +1662,12 @@ Instruction *InstCombiner::visitCallInst
>>        return replaceInstUsesWith(*II, V);
>>      break;
>>
>> +  case Intrinsic::x86_avx_maskload_ps:
>> +  // TODO: Add the other masked load variants.
>> +    if (Instruction *I = simplifyX86MaskedLoad(*II, *this))
>> +      return I;
>> +    break;
>> +
>>    case Intrinsic::x86_avx_maskstore_ps:
>>    case Intrinsic::x86_avx_maskstore_pd:
>>    case Intrinsic::x86_avx_maskstore_ps_256:
>>
>> Modified: llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll?rev=262269&r1=262268&r2=262269&view=diff
>>
>> ==============================================================================
>> --- llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll (original)
>> +++ llvm/trunk/test/Transforms/InstCombine/x86-masked-memops.ll Mon Feb
>> 29 17:16:48 2016
>> @@ -1,5 +1,64 @@
>>  ; RUN: opt < %s -instcombine -S | FileCheck %s
>>
>> +;; MASKED LOADS
>> +
>> +; If the mask isn't constant, do nothing.
>> +
>> +define <4 x float> @mload(i8* %f, <4 x i32> %mask) {
>> +  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8* %f, <4 x
>> i32> %mask)
>> +  ret <4 x float> %ld
>> +
>> +; CHECK-LABEL: @mload(
>> +; CHECK-NEXT:  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8*
>> %f, <4 x i32> %mask)
>> +; CHECK-NEXT:  ret <4 x float> %ld
>> +}
>> +
>> +; Zero mask is a nop.
>> +
>> +define <4 x float> @mload_zeros(i8* %f) {
>> +  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8* %f, <4 x
>> i32> zeroinitializer)
>> +  ret <4 x float> %ld
>> +
>> +; CHECK-LABEL: @mload_zeros(
>> +; CHECK-NEXT:  ret <4 x float> undef
>> +}
>> +
>> +; Only the sign bit matters.
>> +
>> +define <4 x float> @mload_fake_ones(i8* %f) {
>> +  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8* %f, <4 x
>> i32> <i32 1, i32 2, i32 3, i32 2147483647>)
>> +  ret <4 x float> %ld
>> +
>> +; CHECK-LABEL: @mload_fake_ones(
>> +; CHECK-NEXT:  ret <4 x float> undef
>> +}
>> +
>> +; All mask bits are set, so this is just a vector load.
>> +
>> +define <4 x float> @mload_real_ones(i8* %f) {
>> +  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8* %f, <4 x
>> i32> <i32 -1, i32 -2, i32 -3, i32 2147483648>)
>> +  ret <4 x float> %ld
>> +
>> +; CHECK-LABEL: @mload_real_ones(
>> +; CHECK-NEXT:  %castvec = bitcast i8* %f to <4 x float>*
>> +; CHECK-NEXT:  %unmaskedload = load <4 x float>, <4 x float>* %castvec
>> +; CHECK-NEXT:  ret <4 x float> %unmaskedload
>> +}
>> +
>> +; It's a constant mask, so convert to an LLVM intrinsic. The backend
>> should optimize further.
>> +
>> +define <4 x float> @mload_one_one(i8* %f) {
>> +  %ld = tail call <4 x float> @llvm.x86.avx.maskload.ps(i8* %f, <4 x
>> i32> <i32 0, i32 0, i32 0, i32 -1>)
>> +  ret <4 x float> %ld
>> +
>> +; CHECK-LABEL: @mload_one_one(
>> +; CHECK-NEXT:  %castvec = bitcast i8* %f to <4 x float>*
>> +; CHECK-NEXT:  %1 = call <4 x float> @llvm.masked.load.v4f32(<4 x
>> float>* %castvec, i32 1, <4 x i1> <i1 false, i1 false, i1 false, i1 true>,
>> <4 x float> undef)
>> +; CHECK-NEXT:  ret <4 x float> %1
>> +}
>> +
>> +;; MASKED STORES
>> +
>>  ; If the mask isn't constant, do nothing.
>>
>>  define void @mstore(i8* %f, <4 x i32> %mask, <4 x float> %v) {
>> @@ -131,6 +190,10 @@ define void @mstore_v4i64(i8* %f, <4 x i
>>  ; CHECK-NEXT:  ret void
>>  }
>>
>> +declare <4 x float> @llvm.x86.avx.maskload.ps(i8*, <4 x i32>)
>> +declare <2 x double> @llvm.x86.avx.maskload.pd(i8*, <2 x i64>)
>> +declare <8 x float> @llvm.x86.avx.maskload.ps.256(i8*, <8 x i32>)
>> +declare <4 x double> @llvm.x86.avx.maskload.pd.256(i8*, <4 x i64>)
>>
>>  declare void @llvm.x86.avx.maskstore.ps(i8*, <4 x i32>, <4 x float>)
>>  declare void @llvm.x86.avx.maskstore.pd(i8*, <2 x i64>, <2 x double>)
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160412/8c72fe79/attachment.html>