[PATCH] [InstCombine][X86] Improved folding of calls to Intrinsic::x86_sse4a_insertqi.

Thu Dec 11 05:53:37 PST 2014

Hi David, Quentin,

I have just uploaded a new version of the patch here:
http://reviews.llvm.org/D6583

Unfortunately Phabricator didn't send an update email on the mailing
list. So, I am attaching the raw diff on this email as well.

Please let me know if ok to submit.

Thanks again for your time.
Andrea


On Wed, Dec 10, 2014 at 12:06 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi David and Quentin,
>
> thanks for the reviews.
> For some reasons I am currently unable to log on Phabricator. Please
> find my answer to your questions below.
>
> On Wed, Dec 10, 2014 at 2:08 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>> ================
>> Comment at: lib/Transforms/InstCombine/InstCombineCalls.cpp:784
>> @@ +783,3 @@
>> +        // results are undefined.
>> +        if ((Index + Length) > 64)
>> +          return ReplaceInstUsesWith(CI, UndefValue::get(II->getType()));
>> ----------------
>> majnemer wrote:
>>> Is there anything that prevents `Index + Length` from wrapping around? If not, what happens?
>> I think there is nothing that would prevent that, but in that case we fall into this condition:
>> "If the sum of the bit index + length field is greater than 64, the results are undefined.”
>>
>> So I’d say nothing to be done here but maybe I misunderstood something :).
>>
> Hi David,
>
> My understanding is that x86 builtin '__builtin_ia32_insertqi' forces
> 'Index' and 'Length' to be 8-bit quantities.
> In Clang we have (BuiltinsX86.def)  BUILTIN(__builtin_ia32_insertqi,
> "V2LLiV2LLiV2LLiIcIc");
> Clang definition of intrinsic _mm_inserti_si64 (in ammintrin.h)
> explicitly casts 'len' and 'idx' to 'char' before calling
> __builtin_ia32_insertqi.
> GCC also forces __builtin_ia32_insertqi to use 8-bit immediate values
> for 'len' and 'idx'.
>
> If for example I write:
>
>   __m128i foo(__m128i A, __m128i B) {
>     return __builtin_ia32_insertqi((__v2di) A, (__v2di) B, 255, 255);
>   }
>
> with this patch, the call to builtin __builtin_ia32_insertqi gets
> folded to Undef. That's because of the rule that Quentin mentioned in
> his last post.
>
> If I instead write:
>
>   __m128i foo(__m128i A, __m128i B) {
>     return __builtin_ia32_insertqi((__v2di) A, (__v2di) B, 255, 256);
>   }
>
> The 256 number would be truncated to 0 (Clang would also raise a
> warning for this "int to char" conversion).
> gcc would raise instead an error: "the last argument must be an 8-bit
> immediate".
>
> Back to my code: it is true that I don't check if `Index + Length` may
> wrap. However, CIStart and CIWidth are expected to be 8-bit
> quantities. So, unless I forgot something important, the wrapping
> shouldn't ever occur. Both Index and Length are obtained from
> zero-extending CIStart and Length (two 8-bit quantities) to
> 'unsigned'.
>
> If you want, I can add an assert to check that both Index and Length
> are never bigger than 255.
>
> @Quentin
> I will add a "Per AMD64 manual" in the comment.
>
> Thanks!
> Andrea
>
>> http://reviews.llvm.org/D6583
>>
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
Index: lib/Transforms/InstCombine/InstCombineCalls.cpp
===================================================================

--- lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -733,7 +733,22 @@
     // TODO: eventually we should lower this intrinsic to IR
     if (auto CIWidth = dyn_cast<ConstantInt>(II->getArgOperand(2))) {
       if (auto CIStart = dyn_cast<ConstantInt>(II->getArgOperand(3))) {
-        if (CIWidth->equalsInt(64) && CIStart->isZero()) {
+        unsigned Index = CIStart->getZExtValue();
+        // From AMD documentation: "a value of zero in the field length is
+        // defined as length of 64".
+        unsigned Length = CIWidth->equalsInt(0) ? 64 : CIWidth->getZExtValue();
+
+        // From AMD documentation: "If the sum of the bit index + length field
+        // is greater than 64, the results are undefined".
+
+        // Note that both field index and field length are 8-bit quantities.
+        // Since variables 'Index' and 'Length' are unsigned values
+        // obtained from zero-extending field index and field length
+        // respectively, their sum should never wrap around.
+        if ((Index + Length) > 64)
+          return ReplaceInstUsesWith(CI, UndefValue::get(II->getType()));
+
+        if (Length == 64 && Index == 0) {
           Value *Vec = II->getArgOperand(1);
           Value *Undef = UndefValue::get(Vec->getType());
           const uint32_t Mask[] = { 0, 2 };
Index: test/Transforms/InstCombine/vec_demanded_elts.ll
===================================================================
--- test/Transforms/InstCombine/vec_demanded_elts.ll
+++ test/Transforms/InstCombine/vec_demanded_elts.ll
@@ -303,6 +303,33 @@
   ret <2 x i64> %2
 }
 
+; CHECK: define <2 x i64> @testZeroLength(<2 x i64> %v, <2 x i64> %i)
+define <2 x i64> @testZeroLength(<2 x i64> %v, <2 x i64> %i) {
+; CHECK: ret <2 x i64> %i
+  %1 = tail call <2 x i64> @llvm.x86.sse4a.insertqi(<2 x i64> %v, <2 x i64> %i, i8 0, i8 0)
+  ret <2 x i64> %1
+}
+
+; CHECK: define <2 x i64> @testUndefinedInsertq_1(<2 x i64> %v, <2 x i64> %i)
+define <2 x i64> @testUndefinedInsertq_1(<2 x i64> %v, <2 x i64> %i) {
+; CHECK: ret <2 x i64> undef
+  %1 = tail call <2 x i64> @llvm.x86.sse4a.insertqi(<2 x i64> %v, <2 x i64> %i, i8 0, i8 16)
+  ret <2 x i64> %1
+}
+
+; CHECK: define <2 x i64> @testUndefinedInsertq_2(<2 x i64> %v, <2 x i64> %i)
+define <2 x i64> @testUndefinedInsertq_2(<2 x i64> %v, <2 x i64> %i) {
+; CHECK: ret <2 x i64> undef
+  %1 = tail call <2 x i64> @llvm.x86.sse4a.insertqi(<2 x i64> %v, <2 x i64> %i, i8 48, i8 32)
+  ret <2 x i64> %1
+}
+
+; CHECK: define <2 x i64> @testUndefinedInsertq_3(<2 x i64> %v, <2 x i64> %i)
+define <2 x i64> @testUndefinedInsertq_3(<2 x i64> %v, <2 x i64> %i) {
+; CHECK: ret <2 x i64> undef
+  %1 = tail call <2 x i64> @llvm.x86.sse4a.insertqi(<2 x i64> %v, <2 x i64> %i, i8 64, i8 16)
+  ret <2 x i64> %1
+}
 
 ; CHECK: declare <2 x i64> @llvm.x86.sse4a.insertqi
 declare <2 x i64> @llvm.x86.sse4a.insertqi(<2 x i64>, <2 x i64>, i8, i8) nounwind