[PATCH] D22840: Split the store of an int value merged from a pair of smaller values into multiple stores.

Thu Aug 25 20:44:55 PDT 2016

chandlerc added a comment.

Sorry, I had lost track of this, but I can help finish up the review here.

In https://reviews.llvm.org/D22840#507217, @wmi wrote:

> Some more changes:
>
> 1. add x86_64-unknown-linux-gnu triple for test.
> 2. I found it was too limited for the optimization to say the value of store can only had one use so I removed it. It is possible that it has more than one use but the other uses are in cold blocks. After store splitting and machineSinking, those bitwise instructions will be moved to colder places.  However, existing ISel cannot look beyond BB boundary so it is impossible to check other uses. Can we blindly do this (splitting an int64 store to two on x86 may not be very bad for performance)? I really hope we have globalISel here, or is there existing better solution for it? I have done some internal testing for it. With the extension of more than one uses of store value, I see 3% improvement for one internal benchmark (with https://reviews.llvm.org/D23210) and no regressions.


I agree that the one-use test is too restrictive.

I think for x86, doing this blindly when it is a float and an int is a clear and unambiguous win.

I think it is much less clear for pairs of integers... Does restricting this to only cases with a mixture of floating point and integers still capture all of the improvements? If so, I'd start there.

To implement this, I'd actually detect the two elements merged into a larger integer, and pass those two values to the predicate routine so that targets can inspect them as part of determining whether split stores would be better. Then you can have the x86 target check for a mixture of FP and int types to trigger split stores.

Even if we end up wanting more cases to be split for x86, I think this is probably a good way to structure the predicate so that targets can make more detailed decisions about this kind of thing.


================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12159-12162
@@ -12157,1 +12158,6 @@
 
+  if (TLI.isMultiStoresCheaperThanBitsMerge()) {
+    if (SDValue NewSt = splitMergedValStore(ST))
+      return NewSt;
+  }
+
----------------
No need for braces around the outer if here, although it probably won't matter based on the refactoring I suggested...

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12173-12184
@@ +12172,14 @@
+/// Instruction sequence of i64 Store:
+///         t1: i32 = bitcast float_tmp
+///       t2: i64 = zero_extend t1
+///         t3: i64 = zero_extend i32_tmp
+///       t4: i64 = shl t3, Constant:i8<32>
+///     t5: i64 = or i64 t2, t4
+///   t6: ch = store<ST8[addr]> t0, t5, FrameIndex:i64<0>, undef:i64
+///
+/// Instruction sequence of splitted i32 stores:
+///     t1: i32 = bitcast float_tmp
+///   t2: ch = store<ST4[addr](align=8)> t0, t1, FrameIndex:i64<0>, undef:i64
+///     t3: i64 = add FrameIndex:i64<0>, Constant:i64<4>
+///   t4: ch = store<ST4[addr+4]> t0, i32_tmp, t3, undef:i64
+///
----------------
You can write DAG examples using a more brief form that is pretty common:

  /// (store (or (zext (bitcast F to i32) to i64),
  ///            (shl (zext I to i64), 32)), addr)

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12217-12221
@@ +12216,7 @@
+  SDValue Shl, Lo, Hi;
+  if (Op1.getOpcode() == ISD::SHL) {
+    Shl = Op1;
+    Lo = Op2;
+    Hi = Shl.getOperand(0);
+  } else if (Op2.getOpcode() == ISD::SHL) {
+    Shl = Op2;
----------------
A common slightly shorter idiom than this that we use is to test if Op1 is *not* SHL, and then swap, test again and bail, and then you know Op1 is the SHL and you can directly extract the Hi value.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12247
@@ +12246,3 @@
+    return SDValue();
+
+  // Start to split store.
----------------
FWIW, here is where I would do the target query, passing in the inputs to the zero extends.

Then in the x86 implementation of the target query, I would look for a bitcast from a float, or whatever we end up with for the heuristic.


Repository:
  rL LLVM

https://reviews.llvm.org/D22840