[PATCH] D131047: [AArch64] Add a tablegen pattern to transform duplane(scalar_to_vector(x),0) to dup(x), and vectorize scalar operands for aarch64.neon.pmull64 intrinsic

Mon Aug 15 22:02:04 PDT 2022

mingmingl planned changes to this revision.
mingmingl added a comment.

In D131047#3723365 <https://reviews.llvm.org/D131047#3723365>, @dmgreen wrote:

> Can you upload with full context?

Ah, I should have made it more explicit that the context above was all my findings.

> If this is relying on DUPLANE(SCALAR_TO_VEC) not simplifying into DUP, that might not always be true as more optimizations are added in the future.

I didn't realize the change relies on the miss of `duplane(scalar_to_vector)->dup`. Thanks for the catch!

> It may be better to canonicalize all PMULL64 to use v1i64 vectors, and use EXTRACT_SUBVECTOR from vectors. It may require adding a new PMULL64 node if the intrinsics require i64 inputs, but it should allow ld1r to be created as opposed to load+dup.

Re "canonicalize all PMULL64 to use v1i64 and adding a PMULL64 node"  does this mean adding a aarch64-specific SelectionDAG node for PMULL64 (llvm doc <https://github.com/llvm/llvm-project/blob/556efdba8569f748fd805e30beeae8afa2d6a343/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L5238>), like what SMULL/UMULL does? This sounds promising and I will make a change. Let me know if I misunderstand anything, thanks for all the guidance!

Re `aarch64_neon_pmull64` relies on i64 inputs, yeah this means [1] won't compile, and [2] will require a `v1i64 -> i64` change, which is not desired (details in [3])

[1]

  // with "Type set is empty for each HW mode" error from tablegen 
  def : Pat<(int_aarch64_neon_pmull64 (v1i64 (extract_subvector V128:$Rn, (i64 1))),
                                      (v1i64 (extract_subvector V128:$Rm, (i64 1)))),
                  (PMULLv2i64 V128:$Rn, V128:$Rm)>

[2]

  def : Pat<(int_aarch64_neon_pmull64 (i64 (bitconvert (v1i64 (extract_subvector V128:$Rn, (i64 1))))),
                                      (i64 (bitconvert (v1i64 (extract_subvector V128:$Rm, (i64 1)))))),
            (PMULLv2i64 V128:$Rn, V128:$Rm)>;

[3]
First, `v1i64->i64`  reverses `i64->v1i64`, so having both of them either gives indefinite "combine->legalization->combine" loop ( nodes created by dag-combiner will be re-legalized <https://github.com/llvm/llvm-project/blob/556efdba8569f748fd805e30beeae8afa2d6a343/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L1603>  ) , or makes the steps super stateful and complex.
Secondly, PreprocessISelDAG <https://github.com/llvm/llvm-project/blob/556efdba8569f748fd805e30beeae8afa2d6a343/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp#L1061>, the per-target hook right before ISel begins, isn't implemented in aarch64 backend. And implementing it (for this use case) comes with a non-eligible cost to iterate all instructions.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131047/new/

https://reviews.llvm.org/D131047