[llvm] [APFloat] Add APFloat support for E8M0 type (PR #107127)

Thu Sep 5 11:36:53 PDT 2024

================
@@ -195,6 +195,12 @@ struct APFloatBase {
     // improved range compared to half (16-bit) formats, at (potentially)
     // greater throughput than single precision (32-bit) formats.
     S_FloatTF32,
+    // 8-bit floating point number with (all the) 8 bits for the exponent
+    // like in FP32. There are no zeroes, no infinities, and no denormal values.
+    // NaN is represented with all bits set to 1. Bias is 127.
+    // This represents the scale data type in the MX specification from
+    // https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
+    S_Float8E8M0FN,
----------------
dcaballe wrote:

Excuse my ignorance here but doesn't the `N` suffix mean that there is no NaN representation or is my understanding incorrect? Should this be `S_Float8E8M0F`, instead?

https://github.com/llvm/llvm-project/pull/107127