[Mlir-commits] [mlir] Fix TOSA FP16->INT16 CAST lowering (PR #79299)

Wed Jan 24 13:36:27 PST 2024

================
@@ -480,23 +480,53 @@ createLinalgBodyCalculationForElementwiseOp(Operation *op, ValueRange args,
     }
 
     if (arith::FPToSIOp::areCastCompatible(srcTy, dstTy)) {
-      auto intMin = rewriter.create<arith::ConstantOp>(
+      auto intMinFP = rewriter.create<arith::ConstantOp>(
           loc, rewriter.getFloatAttr(
                    getElementTypeOrSelf(srcTy),
                    APInt::getSignedMinValue(dstTy.getIntOrFloatBitWidth())
                        .getSExtValue()));
 
-      auto intMax = rewriter.create<arith::ConstantOp>(
+      auto rounded = rewriter.create<math::RoundEvenOp>(loc, args[0]);
+
+      // The input floating-point type has enough mantissa bits to represent
+      // the max int value so just clamp the input in the floating-point
+      // domain and convert to int. Note: the min value can be represented
+      // because it consists of a mantissa with only the lsb set.
+      if (cast<FloatType>(srcTy).getFPMantissaWidth() >=
+          dstTy.getIntOrFloatBitWidth() - 1) {
----------------
RoboTux wrote:

> Can you expand on the INT_MIN case? I'd think that the minimum 16-bit int would be -2^15 and if the FP format has a dedicated sign bit (and in turn +0 and -0), the mantissa needs to have 16 bits as well to represent 2^15 (i.e., sign bit set, msb mantissa bit set).

-2¹⁵ is -(1 << 15) [1] so we only need 1 bit of mantissa but you make me realize that I didn't consider whether the exponent can be encoded. For FP16 the bias is 15 and the exponent has 5 bits which allow us a max exponent of 2⁵-1-1-15 = 30-15 = 15 so all good. But clearly FP8->I16 would be a problem and likewise FP16->I32 though in both cases we don't need to clamp since the integer result can encode all the integer values that the floating-point can encode. So I think we need more cases:
- no need to clamp (e.g. FP8 -> I16 or FP16 -> I32)
- clamp needed but int min cannot be encoded in FP
- clamp needed and both min and max can be encoded in FP

[1] My understanding is that the real (as in real numbers) value of a non subnormal floating-point value is (-1) * sign * (mantissa << (exponent - bias)).
> 
> I haven't thought about this too much so maybe I'm missing something, but shouldn't the condition be
> 
> ```
> if (MantissaWidth() >= IntBitWidth())
> ```

For a n-bit signed int, the max int value is 0 followed by (n-1) 1 bits. So we need a mantissa width at least as much as IntBitWidth() - 1. Note that the Mantissa includes any implicit leading bit so we don't have to worry about that.

https://github.com/llvm/llvm-project/pull/79299