[Mlir-commits] [mlir] Add arith expansion of f8E8M0 type for extf/trunc ops (PR #140332)

Tue May 20 14:30:07 PDT 2025

krzysz00 wrote:

As a side note, `llvm::APFloat` crashes on negative `f8E8M0`s

Re `arith.truncf`, it's worth noting that most operations that operate on a "f8E8M0-in-f32" (AMD's got a bunch of these) ignore the sign bit (that is, implicitly take a fabs()).

>From the sources I'm seeing, the "natural" conversion function from f32 to f8E8M0 isn't really a floating point truncate, but
```llvm
f8E8M0 @get_exponent(float %source) {
  %source.bits = bitcast float %source to i32
  %no.mantissa = lshr i32 %source.bits, 23
  %f8e8m0 = trunc i32 %no.mantissa to i8
  %was.nan = fcmp oeq %source, %source
  %result = select i8, %was.nan, i8 0xff, i8 %f8e8M0
  ret f8E8M0 %result
}
```

I _suspect_ that we may want an operation that isn't named `truncf` here - something like `arith.extract_exponent`, but then that leaves us with `truncf` unimplemented for f8E8M0 ... which ... maybe that's true.

(That is to say, in practice, when we're writing out a buffer of scales after a post-matmul quantization, we'll want to do the exponent extraction, not a more principled truncf)

https://github.com/llvm/llvm-project/pull/140332