[all-commits] [llvm/llvm-project] 50af82: [AArch64] Cost all perfect shuffles entries as cost 1

Tue Apr 19 04:05:18 PDT 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 50af82701c169e3ca2cf31a89fe7927e0ecb9b35
      https://github.com/llvm/llvm-project/commit/50af82701c169e3ca2cf31a89fe7927e0ecb9b35
  Author: David Green <david.green at arm.com>
  Date:   2022-04-19 (Tue, 19 Apr 2022)

  Changed paths:
    M llvm/lib/Target/AArch64/AArch64PerfectShuffle.h
    M llvm/test/CodeGen/AArch64/aarch64-wide-shuffle.ll
    M llvm/test/CodeGen/AArch64/arm64-dup.ll
    M llvm/test/CodeGen/AArch64/arm64-rev.ll
    M llvm/test/CodeGen/AArch64/build-vector-extract.ll
    M llvm/test/CodeGen/AArch64/insert-extend.ll
    M llvm/test/CodeGen/AArch64/neon-wide-splat.ll
    M llvm/test/CodeGen/AArch64/shuffle-tbl34.ll
    M llvm/test/CodeGen/AArch64/shuffles.ll
    M llvm/test/CodeGen/AArch64/sinksplat.ll
    M llvm/utils/PerfectShuffle/PerfectShuffle.cpp

  Log Message:
  -----------
  [AArch64] Cost all perfect shuffles entries as cost 1

A brief introduction to perfect shuffles - AArch64 NEON has a number of
shuffle operations - dups, zips, exts, movs etc that can in some way
shuffle around the lanes of a vector. Given a shuffle of size 4 with 2
inputs, some shuffle masks can be easily codegen'd to a single
instruction. A <0,0,1,1> mask for example is a zip LHS, LHS. This is
great, but some masks are not so simple, like a <0,0,1,2>. It turns out
we can generate that from zip LHS, <0,2,0,2>, having generated
<0,2,0,2> from uzp LHS, LHS, producing the result in 2 instructions.

It is not obvious from a given mask how to get there though. So we have
a simple program (PerfectShuffle.cpp in the util folder) that can scan
through all combinations of 4-element vectors and generate the perfect
combination of results needed for each shuffle mask (for some definition
of perfect). This is run offline to generate a table that is queried for
generating shuffle instructions. (Because the table could get quite big,
it is limited to 4 element vectors).

In the perfect shuffle tables zip, unz and trn shuffles were being cost
as 2, which is higher than needed and skews the perfect shuffle tables
to create inefficient combinations. This sets them to 1 and regenerates
the tables. The codegen will usually be better and the costs should be
more precise (but it can get less second-order re-use of values from
multiple shuffles, these cases should be fixed up in subsequent patches.

Differential Revision: https://reviews.llvm.org/D123379