[PATCH] D67645: [aarch64] add def-pats for dot product

Wed Sep 18 07:49:18 PDT 2019

sebpop added a comment.

To catch more dot product cases, we need to fix the passes above instruction selection.

I looked at the basic dot product loop:

  int dot_product1(char *a, char *b, int sum) {
    for (int i = 0; i < 16 * K; i += 1)
      sum += a[i] * b[i];
    return sum;
  }

for different values of K:

- for K = 1, we do generate a dot instruction
- for K = 2, K = 3
  - the loop is unrolled
  - SLP vectorizes the straight line code with vector factor 32
  - type legalization kicks in and destroys the pattern
  - we end up generating very poor code
- K >= 4, no unroll, no SLP, no loop vectorization -> scalar byte loop code.

Looks like if we want to catch more dot product patterns, we'll need to fix the SLP and loop vectorizers.

I am also looking at some code that comes from TVM that is a higher level compiler generating code to LLVM IR.
I have seen that there is a missing pattern in interleaved load pass and a missing instruction in arm64: a ld8.
That is an interleaved load for an 8 by 8 byte matrix.
I think we can generate an i16 ld4 and then generate the low/high byte extracts in each lane.
This will simplify the dag on which we do instruction selection and enable generation of the dot product.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67645/new/

https://reviews.llvm.org/D67645