[PATCH] [InstCombine] Combine adjacent i8 loads.

Fri May 2 09:32:05 PDT 2014

There are four parts to the problem of widening chains when viewed from the SLP vectorizers perspective (staying in vector types):

* Recognizing adjacent memory operation: this is obviously similar
* Widening operations: we would widen only the load in this example
* Building chains: there is no real chain of widened operations here: only the load is widened. One could imagine examples where we perform operations on the loaded i8 type before we build the i64.
* Finding starting points of chains: this would be to recognize reduction into the wider type in this example.

If we were to model the example given in the patch in the slp vectorizer, I think we would (I changed the example to i32 to have to type less) recognize that
 (I64 OR (I64 << (SEXT (I32 ...) to I64), 0))
              (I64 << (SEXT(I32 ...) to I64), 32)))

Is a reduction into a I64 value which we can model as <2 x i32>:

(I64 CAST (<2 x i32> SHUFFLE (...)))

And then start a chain from the <2 x i32> (...) root which in this case would only be a <2 x i32> load (or in the real example a <8 x i8> load).

Do we expect that there would be longer chains that would benefit from widening that would start of such a pattern? I.e do we expect to be able to do some isomorphic operations in the smaller type (i8 or i32 in my example) before the reduction? If the answer is yes, I think it make sense thinking about doing this in the SLP vectorizer. 

I am not sure that CodeGen deals well with such contortions, though: (I64 CAST (<8 x i8> SHUFFLE (<8 x i8> LOAD))) => (I64 (BSWAP (I64 LOAD)))? That could be fixed. What does our cost model say about such operations?

Teaching the SLP vectorizer to widen scalar types is a whole different complexity beast (I am not sure we want to model lanes of smaller types in a large type without using vectors).

I don't think the above transformation (building a value of a bigger type from a smaller type) is going to interfere with regular SLP vectorization because we start bottom-up (sink to source) and we don't have patterns that would start at an "or reduction" (bigger than 2 operations). If we implement a second transformation that starts from loads and widens operations top-down, then, I agree with Andy, we would have to be careful about phase ordering.

If however, all we want to catch is swap then this feels like a dag combine to me (with the gotcha of loosing analysis information during lowering mentioned below). But, it seems to me there is potential to catch longer chains leading to the loads.

Doing this at the IR level has the benefit that our memory analysis (BasicAA) is better in the current framework. Inlining can cause us to loose information about aliasing (lost noalias parameters, we should really fix this :), Hal had a patch but I digress ...).

http://reviews.llvm.org/D3580