[Mlir-commits] [mlir] Add arith expansion of f8E8M0 type for extf/trunc ops (PR #140332)
Krzysztof Drewniak
llvmlistbot at llvm.org
Tue May 20 14:30:07 PDT 2025
krzysz00 wrote:
As a side note, `llvm::APFloat` crashes on negative `f8E8M0`s
Re `arith.truncf`, it's worth noting that most operations that operate on a "f8E8M0-in-f32" (AMD's got a bunch of these) ignore the sign bit (that is, implicitly take a fabs()).
>From the sources I'm seeing, the "natural" conversion function from f32 to f8E8M0 isn't really a floating point truncate, but
```llvm
f8E8M0 @get_exponent(float %source) {
%source.bits = bitcast float %source to i32
%no.mantissa = lshr i32 %source.bits, 23
%f8e8m0 = trunc i32 %no.mantissa to i8
%was.nan = fcmp oeq %source, %source
%result = select i8, %was.nan, i8 0xff, i8 %f8e8M0
ret f8E8M0 %result
}
```
I _suspect_ that we may want an operation that isn't named `truncf` here - something like `arith.extract_exponent`, but then that leaves us with `truncf` unimplemented for f8E8M0 ... which ... maybe that's true.
(That is to say, in practice, when we're writing out a buffer of scales after a post-matmul quantization, we'll want to do the exponent extraction, not a more principled truncf)
https://github.com/llvm/llvm-project/pull/140332
More information about the Mlir-commits
mailing list