[PATCH] D131047: [AArch64] Add a tablegen pattern for aarch64.neon.pmull64

Sun Aug 14 00:04:59 PDT 2022

mingmingl added a comment.

In D131047#3698635 <https://reviews.llvm.org/D131047#3698635>, @mingmingl wrote:

> In D131047#3696069 <https://reviews.llvm.org/D131047#3696069>, @dmgreen wrote:
>
>> The problem with tablegen patterns that produce multiple values is that they don't allow the other nodes to combine as they naturally should. For example in this case a `dup(load` could be a `ld1r`. And I'm pretty sure a `dup v1, x1` should be better than a 'fmov;dup' pair.
>>
>> There is some code that already handles dup of other mull instructions, including pmull in tryCombineLongOpWithDup. It doesn't look like it gets called for a pmull64 though, could it be used here?
>
> Thanks for pointing out `tryCombineLongOpWithDup` (wasn't aware of this approach before); it looks reasonable to me to follow the dag-combiner path. Plan changes.

After a study (details elaborated below), update the tablegen pattern to use `DUPv2i64gpr` to dup from GPR directly.

Some contexts on why updating tablegen pattern:

1. Ultimately, we want `dup V128, GPR64` as opposed to `fmov V64 GPR` (or `dup V64 GPR`) followed by a `duplane64 V128, 0`; so tablegen pattern use `DUPv2i64gpr` (dup from main to SIMD register) to express the intention.
2. Alternatively, this two-step approach works
  - First step is to transform `GPR64` to `extractelt (duplane64 (scalar_to_vector V128, GPR64), 0), 1` (in AArch64ISelLowering.cpp) so `pmull2` is used, and second step is to transform `(duplane64 (scalar_to_vector V128, GPR64), 0)` to `dup V128, GPR` (in AArch64InstrInfo.td)
  - https://reviews.llvm.org/differential/diff/452482/ shows the {cpp, td, test} changes. Note the tests are updated in the same way as this patch.
3. Both this patch and the alternative option in 2, retain  `duplane64` throughout the legalize/combine steps, until instruction selection really begins. This is because `extactelt(dup x)` will be combined into x (code where this happens <https://github.com/llvm/llvm-project/blob/2a4625530fb68bd5b7cf0d61372e93beaf053444/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L15242-L15245>)
4. `tryCombineLongOpWithDup` is suitable when operands are vector, but intrinsic `llvm.aarch64.neon.pmull64` has scalar operands.
  - Vectorizing a scalar operand during dag-combiner is not preferred here, since new operands of a combined node won't be appended to worklist for further combination (only the combined node and its users will be added to the worklist).
  - Even if new operands could be added to worklist for further combination opportunities, we want to retain `duplane64` (rather than combining it into `dup`) as mentioned in 3.

(Sorry for the delay; I did some study for the findings and then took a one-week vacation)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131047/new/

https://reviews.llvm.org/D131047