[llvm] [X86][BF16] Improve vectorization of BF16 (PR #88486)
Krzysztof Drewniak via llvm-commits
llvm-commits at lists.llvm.org
Thu May 2 00:22:17 PDT 2024
================
@@ -56517,17 +56501,40 @@ static SDValue combineFP16_TO_FP(SDNode *N, SelectionDAG &DAG,
static SDValue combineFP_EXTEND(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {
+ EVT VT = N->getValueType(0);
+ bool IsStrict = N->isStrictFPOpcode();
+ SDValue Src = N->getOperand(IsStrict ? 1 : 0);
+ EVT SrcVT = Src.getValueType();
+
+ SDLoc dl(N);
+ if (SrcVT.getScalarType() == MVT::bf16) {
+ if (!IsStrict && Src.getOpcode() == ISD::FP_ROUND &&
+ Src.getOperand(0).getValueType() == VT)
----------------
krzysz00 wrote:
Discussion noted. However, I as the programmer want to explicitly perform that truncate-extend behavior in one spot in my input (because I'm testing a bfloat function whose result has been `fpext`ed to against a floating-point version that's had its lower bits masked off).
This rewrite has caused per-element result errors around 1e-2 (if I remember right 16.25 vs 16.3125 or the like)
I understand that this intermediate elimination improves performance and is numerically useful a lot of the time, so, given that ... what mechanism would you recommend for forcing this optimization to not fire for a particular pair of round and extend operations?
(I don't want to make things strict at the function level if possible - I want to protect a particular fptrunc / fpext pair from this optimization. Would `canoicalize` do it, or should I stick in s more interesting noop?)
https://github.com/llvm/llvm-project/pull/88486
More information about the llvm-commits
mailing list