[all-commits] [llvm/llvm-project] 00f239: [MLIR][TOSA] Add --tosa-reduce-transposes pass (#1...

Fri Sep 13 19:17:17 PDT 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 00f239e48ab9761a963839e118ee6cc4ee42e531
      https://github.com/llvm/llvm-project/commit/00f239e48ab9761a963839e118ee6cc4ee42e531
  Author: Arteen Abrishami <114886331+arteen1000 at users.noreply.github.com>
  Date:   2024-09-13 (Fri, 13 Sep 2024)

  Changed paths:
    M mlir/include/mlir/Dialect/Tosa/IR/TosaOpBase.td
    M mlir/include/mlir/Dialect/Tosa/IR/TosaOps.h
    M mlir/include/mlir/Dialect/Tosa/Transforms/Passes.td
    M mlir/lib/Dialect/Tosa/Transforms/CMakeLists.txt
    A mlir/lib/Dialect/Tosa/Transforms/TosaReduceTransposes.cpp
    A mlir/test/Dialect/Tosa/tosa-reduce-transposes.mlir

  Log Message:
  -----------
  [MLIR][TOSA] Add --tosa-reduce-transposes pass (#108260)

----------
Motivation:
----------

Some legalization pathways introduce redundant tosa.TRANSPOSE
operations that result in avoidable data movement. For example,
PyTorch -> TOSA contains a lot of unnecessary transposes due
to conversions between NCHW and NHWC.

We wish to remove all the ones that we can, since in general
it is possible to remove the overwhelming majority.

------------
Changes Made:
------------

- Add the --tosa-reduce-transposes pass
- Add TosaElementwiseOperator trait.

-------------------
High-Level Overview:
-------------------

The pass works through the transpose operators in the program. It begins
at some
transpose operator with an associated permutations tensor. It traverses
upwards
through the dependencies of this transpose and verifies that we
encounter only
operators with the TosaElementwiseOperator trait and terminate in either
constants, reshapes, or transposes.

We then evaluate whether there are any additional restrictions (the
transposes
it terminates in must invert the one we began at, and the reshapes must
be ones
in which we can fold the transpose into), and then we hoist the
transpose through
the intervening operators, folding it at the constants, reshapes, and
transposes.

Finally, we ensure that we do not need both the transposed form (the
form that
had the transpose hoisted through it) and the untransposed form (which
it was prior),
by analyzing the usages of those dependent operators of a given
transpose we are
attempting to hoist and replace.

If they are such that it would require both forms to be necessary, then
we do not
replace the hoisted transpose, causing the new chain to be dead.
Otherwise, we do
and the old chain (untransposed form) becomes dead. Only one chain will
ever then
be live, resulting in no duplication.

We then perform a simple one-pass DCE, so no canonicalization is
necessary.

--------------
Impact of Pass:
--------------

Patching the dense_resource artifacts (from PyTorch) with dense
attributes to
permit constant folding, we receive the following results.

Note that data movement represents total transpose data movement,
calculated
by noting which dimensions moved during the transpose.

///////////
MobilenetV3:
///////////

BEFORE total data movement: 11798776 B (11.25 MiB)
AFTER total data movement: 2998016 B (2.86 MiB)
74.6% of data movement removed.

BEFORE transposes: 82
AFTER transposes: 20
75.6% of transposes removed.

////////
ResNet18:
////////

BEFORE total data movement: 20596556 B (19.64 MiB)
AFTER total data movement: 1003520 B (0.96 MiB)
95.2% of data movement removed.

BEFORE transposes: 56
AFTER transposes: 5
91.1% of transposes removed.

////////
ResNet50:
////////

BEFORE total data movement: 83236172 B (79.3 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
96.4% of data movement removed

BEFORE transposes: 120
AFTER transposes: 7
94.2% of transposes removed.

/////////
ResNet101:
/////////

BEFORE total data movement: 124336460 B (118.58 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
97.6% of data movement removed

BEFORE transposes: 239
AFTER transposes: 7
97.1% of transposes removed.

/////////
ResNet152:
/////////

BEFORE total data movement: 175052108 B (166.94 MiB)
AFTER total data movement: 3010560 B (2.87 MiB)
98.3% of data movement removed

BEFORE transposes: 358
AFTER transposes: 7
98.0% of transposes removed.

////////
Overview:
////////

We see that we remove up to 98% of transposes and eliminate
up to 98.3% of redundant transpose data movement.

In the context of ResNet50, with 120 inferences per second,
we reduce dynamic transpose data bandwidth from 9.29 GiB/s
to 344.4 MiB/s.

-----------
Future Work:
-----------

(1) Evaluate tradeoffs with permitting ConstOp to be duplicated across
hoisted
    transposes with different permutation tensors.

(2) Expand the class of foldable upstream ReshapeOp we permit beyond
    N -> 1x1x...x1xNx1x...x1x1.

(3) Enchance the pass to permit folding arbitrary transpose pairs,
beyond
    those that form the identity.

(4) Add support for more instructions besides TosaElementwiseOperator as
    the intervening ones (for example, the reduce_* operators).

(5) Support hoisting transposes up to an input parameter.

Signed-off-by: Arteen Abrishami <arteen.abrishami at arm.com>

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications