[llvm] [APFloat] Add APFloat support for E8M0 type (PR #107127)

Sergey Kozub via llvm-commits llvm-commits at lists.llvm.org
Wed Sep 18 00:09:06 PDT 2024


================
@@ -195,6 +195,12 @@ struct APFloatBase {
     // improved range compared to half (16-bit) formats, at (potentially)
     // greater throughput than single precision (32-bit) formats.
     S_FloatTF32,
+    // 8-bit floating point number with (all the) 8 bits for the exponent
+    // like in FP32. There are no zeroes, no infinities, and no denormal values.
+    // NaN is represented with all bits set to 1. Bias is 127.
+    // This represents the scale data type in the MX specification from
+    // https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
+    S_Float8E8M0FN,
----------------
sergey-kozub wrote:

Found a description of suffixes here: https://github.com/jax-ml/ml_dtypes/blob/main/README.md#float8_e4m3fnuz

`F` is for "finite" (no infinities), `N` for with special NaN encoding, `UZ` for unsigned zero.

https://github.com/llvm/llvm-project/pull/107127


More information about the llvm-commits mailing list