[Mlir-commits] [mlir] [mlir] Improvements to the 'quant' dialect (PR #100667)

Tue Aug 27 11:46:39 PDT 2024

================
@@ -0,0 +1,297 @@
+//===- QuantBase.td - Quantization dialect base ------------*- tablegen -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Predicates for types in the Quantization dialect.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef QUANT_BASE
+#define QUANT_BASE
+
+include "mlir/IR/OpBase.td"
+
+def Quant_Dialect : Dialect {
+  let name = "quant";
+  let description = [{
+    The `quant` dialect offers a framework for defining and manipulating
+    quantized values. Central to this framework is the `!quant.uniform` data
+    type, used to represent quantized values. This dialect also provides a
+    suite of operations to handle and convert quantized values between their
+    original floating-point representations and the optimized, lower bit-width
+    integer representations. The `quant` dialect is instrumented with
+    transformation passes to lower these operations into other core MLIR
+    dialects, while also flattening all occurrences of quantized types into
+    their integer counterparts.
+
+
+    ## The `!quant.uniform` type
+
+    The quantization process establishes a relationship between two types of
+    values: an *expressed value* and a *stored value*. The former refers to the
+    floating-point representation used in an original machine learning model,
+    capturing the precise numerical characteristics needed for accurate
+    calculations. The latter is the simplified integer representation that
+    resides in memory after quantization. The `!quant.uniform` data type
+    encodes the necessary information for (lossy) round-trip conversion between
+    an expressed and a stored value.
+
+    The `quant.uniform` type has two variants: per-layer quantization and
+    per-channel (or per-axis) quantization. In per-layer quantization, the
+    quantization information affects an entire tensor uniformly. Conversely, in
+    per-channel quantization, the data type encodes the specific tensor axis
+    that serves as the channel and includes quantization information for each
+    individual channel within the tensor. Below are the specific syntactic and
+    semantic considerations for each modality.
+
+
+    ### Per-layer quantization
+
+    This is the general syntax of the `!quant.uniform` type representing
+    per-layer quantization:
+
+    ```
+    `!quant.uniform` `<`
+      storedType (`<` storageMin `:` storageMax `>`)? `:`
+      expressedType `,`
+      scale (`:` zeroPoint)?
+    `>`
+    ```
+
+    The type contains the following parameters:
+
+    - `storedType`: Integer type of the value stored in memory. This type
+      conveys the bit width and signedness of the quantized stored value.
+      Signed integer types are represented as `'i' bitWidth` (e.g., `i8`),
+      while unsigned integer types are represented as `'u' bitWidth` (e.g.,
+      `u8`).
+
+    - `storageMin`, `storageMax`: Optional bounds for the stored value. If
+      given, they must be within the range of `storedType`. If omitted, the
+      entire range of `storedType` is allowed (e.g., `-128...127` for `i8` or
+      `0...255` for `u8`).
+
+    - `expressedType`: Floating-point type of the value expressed by this
+      quantized type.
+
+    - `scale`: Floating-point value of type `expressedType` used in the
----------------
rafaelubalmw wrote:

It does only accommodate floating-point scaling, but here I'm simply documenting existing behavior. If there is interest in supporting other scaling mechanisms, that's something we could propose in a follow-up PR. See my previous reply regarding the supported floating-point types.

I believe that defining a translation into a multiplier-shift scaling system is indeed out of the scope of the `quant` dialect definition. The way we envision a best-effort TFL-to-TOSA translation to occur (to cite our most immediate application of this work), where the source TFL op uses a `!quant.uniform` operand, is as follows:

1) Attempt to reproduce the TFL execution engine behavior through `tosa.rescale` ops and integer arithmetic for as many corner cases of the TFL op as possible.

2) If a specific combination of parameters in the `!quant.uniform` type is not supported by an existing TFL-to-TOSA rewrite pattern (e.g., specific storage type bit width or per-channel quantization), resort to a dequantization solution based on emitting `quant.dcast` + standard non-quantized lowering for floating-point inputs + `quant.qcast`.

This will guarantee that all valid TFL input arguments are properly supported in a TFL lowering pass, in a workflow where TOSA is the preferred, but not the exclusive, target dialect.



https://github.com/llvm/llvm-project/pull/100667