[PATCH] D70223: [DAGCombine] Split vector load-update-store into single element stores

Fri Nov 22 06:14:58 PST 2019

nemanjai requested changes to this revision.
nemanjai added a comment.
This revision now requires changes to proceed.

This changeset is a rather complicated and hard-to-follow piece of code that appears to solve a problem in the motivating test case, but makes it rather unclear whether it is a good idea or a bad idea overall.
There are a number of things that need to be clarified for this patch to proceed:

1. Needs empirical performance data. This is straightforward - see if it affects the performance of any benchmarks.
2. Needs more thorough testing. We need to cover more types, more ways we may end up with these "merge-and-store" idioms, different numbers of elements changed, etc.
3. An overview of the algorithm should be provided to aid readability. The code as written does not exactly aid readability so it would be good to provide an outline.
4. I still think we should do this in `InstCombine` rather than on the SDAG. It seems like that would be a more natural place to do this.
5. If we **are** doing it in `InstCombine`, why not simply produce a masked store? If we converted the IR for the test case in the description to this:

  define dso_local void @foo(<4 x float>* nocapture %a, float %b) local_unnamed_addr #0 {
  entry:
    call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> <float 1.000000e+00, float undef, float undef, float 2.000000e+00>, <4 x float>* %a, i32 1, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
    ret void
  }
  declare void @llvm.masked.store.v4f32.p0v4f32(<4 x float>, <4 x float>*, i32, <4 x i1>)

We will get the desired codegen:

  lis 4, 16256
  lis 5, 16384
  stw 4, 0(3)
  stw 5, 12(3)

And there is target-independent handling for masked stores, so it is not at all clear to me why we'd go through the trouble of implementing this complex handling in the SDAG.

@spatel I know you were initially against doing this in `InstCombine`, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, `lib/CodeGen/ScalarizeMaskedMemIntrin.cpp` should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of `load/store/insertelement` operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
For the attached example, we are getting rid of a `load` and two `insertelement` instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a `load` and three `insertelement` instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

================
Comment at: llvm/include/llvm/CodeGen/TargetLowering.h:3413

+  /// If it's profitable to split vector store into individual stores.
+  virtual bool isCheapToSplitStore(StoreSDNode *N, unsigned NumSplit,
----------------
This should be more descriptive. Perhaps:
```
/// Determine whether it is profitable to split a single vector store
/// into \p NumSplit scalar stores.
```

Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return `false` when `NumSplit > 2` and quite likely even with `NumSplit == 2` is not cheaper than a single vector store.

The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into `NumSplit` pieces, the answer is clearly - don't split it (one store is better than many).

================
Comment at: llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:6616

+/// getVectorUpdateChain - Detect a load-insert/shuffle-store chain. Returns
+/// head LOAD if we met such pattern. Otherwise, this function returns null
----------------
Nit: name here does not match the function name.

================
Comment at: llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:6756
+  SmallVector<std::pair<SDValue, unsigned>, 16> Buf;
+  SmallVector<std::pair<SDValue, int>, 16> States;
+
----------------
For readability, move these declarations past the early exits.

================
Comment at: llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:6762
+
+  if (!N->isSimple() || VecType.isScalableVector() || !VecType.isVector())
+    return SDValue();
----------------
shchenz wrote:
> Can we handle indexed store too? I guess you also need to exclude that kind of store. 
+1

================
Comment at: llvm/lib/Target/PowerPC/PPCISelLowering.cpp:1609
+
+  return true;
+}
----------------
Really? So it is cheaper to split a `v16i8` vector into 15 pieces than it is to store it with a single store? I really doubt that.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70223/new/

https://reviews.llvm.org/D70223