[PATCH] D22840: Split the store of an int value merged from a pair of smaller values into multiple stores.

Fri Aug 26 22:36:28 PDT 2016

wmi added inline comments.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12208
@@ +12207,3 @@
+/// are bundled together as an i64 value before being stored into memory.
+/// Quite often it is more efficent to generate separate stores for F and I,
+/// which can remove the bitwise instructions or sink them to colder places.
----------------
chandlerc wrote:
> I would just say "Sometimes" here to avoid people defaulting a new target into splitting the stores without measuring.
Fixed.

================
Comment at: lib/Target/X86/X86ISelLowering.h:769-771
@@ +768,5 @@
+                                           SDValue Hi) const override {
+      // If the pair to store is a mixture of float and int values, we will
+      // save two bitwise instructions and one float-to-int instruction and
+      // increase one store instruction. It is more likely a win.
+      if (Lo.getOpcode() == ISD::BITCAST || Hi.getOpcode() == ISD::BITCAST) {
----------------
chandlerc wrote:
> It's probably also worth noting that beyond the instruction count difference, there is potentially a more significant benefit because it avoids the float->int domain switch for input value. I have a suspicion that is a significant part of the savings here.
Comment added

================
Comment at: lib/Target/X86/X86ISelLowering.h:778-780
@@ +777,5 @@
+      }
+      // If the pair only contains int values, we will save two bitwise
+      // instructions and increase one store instruction. We leave the case
+      // out for now and wait until we find a case showing it is beneficial.
+      return false;
----------------
chandlerc wrote:
> And here it is probably worth mentioning the other upside of only doing a single memory operation in addition to having minimal instruction count overhead.
I am not sure here. The added memory operation is a store. Store usually will not stall the pipeline because there is store buffer, so I feel the cost of the extra store will be low? 

================
Comment at: test/Transforms/InstCombine/split-store.ll:16-31
@@ +15,18 @@
+; CHECK: leaq (%rsp), %rdi
+define void @int32_float_pair(i32 %tmp1, float %tmp2) local_unnamed_addr {
+entry:
+  %ref.tmp = alloca i64, align 8
+  %tmpcast = bitcast i64* %ref.tmp to %"pair1"*
+  %t0 = bitcast i64* %ref.tmp to i8*
+  call void @llvm.lifetime.start(i64 8, i8* %t0)
+  %t1 = bitcast float %tmp2 to i32
+  %retval.sroa.2.0.insert.ext.i = zext i32 %t1 to i64
+  %retval.sroa.2.0.insert.shift.i = shl nuw i64 %retval.sroa.2.0.insert.ext.i, 32
+  %retval.sroa.0.0.insert.ext.i = zext i32 %tmp1 to i64
+  %retval.sroa.0.0.insert.insert.i = or i64 %retval.sroa.2.0.insert.shift.i, %retval.sroa.0.0.insert.ext.i
+  store i64 %retval.sroa.0.0.insert.insert.i, i64* %ref.tmp, align 8
+  call void @goo1(%"pair1"* dereferenceable(8) %tmpcast)
+  call void @llvm.lifetime.end(i64 8, i8* %t0)
+  ret void
+}
+
----------------
chandlerc wrote:
> It would be good to minimize this test case rather than keeping it so close to what happens to come out of Clang for a particular C++ input.
> 
> I'd just write direct tests in LLVM IR. Something like:
> 
>   define void @test1(i32 %i, float %f, i64* %ptr) {
>   entry:
>     ... = zext ...
>     ... = bitcast ...
>     ... = zext ...
>     ... = shl ...
>     ... = or ...
>     store ..., i64* %ptr
>     ret void
>   }
> 
> 
That makes the test much shorter! 

Repository:
  rL LLVM

https://reviews.llvm.org/D22840