[PATCH] D135102: [AArch64] Compare BFI and ORR with left-shifted operand for OR instruction selection.

Wed Nov 2 00:47:07 PDT 2022

mingmingl added a comment.

In D135102#3899360 <https://reviews.llvm.org/D135102#3899360>, @dmgreen wrote:

> Hmm. I had not considered am ORR with shift to be cheaper than a BFM before. From what I can tell it doesn't seem to be universal across all cpus, but does look like it will be faster or equal.

Yes it's true that ORR with shift could be the same as BFM (e.g. Cortex A57 <https://developer.arm.com/documentation/uan0015/b>), or faster (e.g. NeoverseN1 <https://developer.arm.com/documentation/swog309707/latest>, CortexA77 <https://en.wikichip.org/w/images/6/6d/arm_cortex_a77_sog.pdf>)

>> A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?
>
> That sounds like it would usually be calculated with KnownBits, like in haveNoCommonBitsSet. Unfortunately post-isel the amount of information we can extract is much less than from the generic DAG nodes.

Yes, `SelectionDAG::computeKnownBits` handles generic DAG nodes but doesn't handle <https://github.com/llvm/llvm-project/blob/76fc5fe64a23edc4574e31d6af4de8e7808b43a7/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp#L3805-L3808> machine-op-code (with NodeType < 0 <https://github.com/llvm/llvm-project/blob/76fc5fe64a23edc4574e31d6af4de8e7808b43a7/llvm/include/llvm/CodeGen/SelectionDAGNodes.h#L698>); as a result, computing something similar to 'haveNoCommonBitSet' won't work out of the box.

>> [About the two/three options]
>
> Is the motivating pattern just the one from the commit message, or any bfm that could be a shifted orr?

Yes, the motivating test case is just the one from commit message; and the other updated tests are results of other lines (that actually look simpler, and added to show ORR-not-BFM is a more generic question to solve).

> aarch64-mi-peephole-opt is an option - we always run into problems implementing things there but if it is easier to write that is always an option. (The machine combiner too, if scheduling info is useful). Larger patterns might be more difficult though. The (existing) code in DAG2DAG doesn't feel like it scales super well. But equally like you say adding ISel nodes has downsides. What would this look like from GlobalISel? How much code would need to be added to make aarch64-mi-peephole-opt work?

BFM is not used in GlobalISel (four instructions generated https://godbolt.org/z/MMvMe34zv), so a BFM pattern matcher (inside peephole or machine-combiner) won't optimize GlobalISel output in this case.

Regarding the amount of work inside peephole (or machine-combiner), I don't have a demo at hand but the number of lines should be within a few hundred (not thousand) just for the motivating test case.

However, without the context that this BFI is from `ISD::OR`, building up this context (that it's correct to convert BFI back to ORR) and fixing the other affected tests in this patch would require some implementation.

> Are the only regressions on uaddo_v4i1 and umulo_v4i1? I'm not against ignoring those, if they are just overflowing nodes on i1 types being awkwardly expanded and it doesn't come up in other places.

In the affected test cases, only uaddo_v4i1 and umulo_v4i1 regressed  -> more generally, useful-bit info (from BFM, lost in ORR) simplifies away one AND node <https://github.com/llvm/llvm-project/blob/76fc5fe64a23edc4574e31d6af4de8e7808b43a7/llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp#L2900> from `Dst` (as shown in code link, when AND zeros exactly the bits that are going to be inserted from `Src`) -> in this sense, other cases might show up (not type-extended small integers)

Maybe I could write a working demo in peephole or machine-combiner for one motivating case as a start? For the rest of tests, I could file Github PR to track them.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135102/new/

https://reviews.llvm.org/D135102