[Mlir-commits] [mlir] Add arith expansion of f8E8M0 type for extf/trunc ops (PR #140332)

Tue May 20 12:33:28 PDT 2025

umangyadav wrote:

> As discussed on the IREE issue, I believe that there is a longstanding tradition that if a conversion returns a finite value then it should be the nearest representable value, that is corroborated by table 3 in the OCP spec discussing the "overflow or saturate" semantics, that is broken by taking the absolute value.

Table 3 is for FP8 types except F8E8M0. 

FP8E8M0 is used for shared block scale which is infact calculated by taking extracting exponent bits of `fabs(value.f32))`.  So it is not really a "conversion" or "cast" in conventional sense. 

OCP Spec has this definition for Fp8E8M0
"E8M0 is an unsigned representation of a conventional biased Float32 exponent"

Here is one of the reference:
https://github.com/amd/Quark/blob/60cd6e46d20a5553a7b1a754c0459737f3c31fde/quark/onnx/operators/custom_ops/src/mx/cuda/mx_kernel.cu#L63

https://github.com/llvm/llvm-project/pull/140332