[all-commits] [llvm/llvm-project] 8f7f9d: [X86] Machine combine vnni instruction.

Thu Apr 27 01:42:59 PDT 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 8f7f9d86a7555263ef08fded15a6b778d796ec3f
      https://github.com/llvm/llvm-project/commit/8f7f9d86a7555263ef08fded15a6b778d796ec3f
  Author: Luo, Yuanke <yuanke.luo at intel.com>
  Date:   2023-04-27 (Thu, 27 Apr 2023)

  Changed paths:
    M llvm/include/llvm/CodeGen/MachineCombinerPattern.h
    M llvm/include/llvm/CodeGen/TargetInstrInfo.h
    M llvm/lib/CodeGen/MachineCombiner.cpp
    M llvm/lib/Target/X86/X86InstrInfo.cpp
    M llvm/lib/Target/X86/X86InstrInfo.h
    M llvm/test/CodeGen/X86/avx512vnni-combine.ll
    M llvm/test/CodeGen/X86/avxvnni-combine.ll

  Log Message:
  -----------
  [X86] Machine combine vnni instruction.

"vpmaddwd + vpaddd" can be combined to vpdpwssd and the latency is
reduced after combination. However when vpdpwssd is in a critical path
the combination get less ILP. It happens when vpdpwssd is in a loop, the
vpmaddwd can be executed in parallel in multi-iterations while vpdpwssd
has data dependency for each iterations. If vpaddd is in a critical path
while vpmaddwd is not, it is profitable to split vpdpwssd into "vpmaddwd
+ vpaddd".
This patch is based on the machine combiner framework to acheive decision
on "vpmaddwd + vpaddd" combination. The typical example code is as
below.
```
__m256i foo(int cnt, __m256i c, __m256i b, __m256i *p) {

    for (int i = 0; i < cnt; ++i) {
        __m256i a = p[i];
        __m256i m = _mm256_madd_epi16 (b, a);
        c = _mm256_add_epi32(m, c);
    }

    return c;
}
```

Differential Revision: https://reviews.llvm.org/D148980