[llvm] [X86][BF16] Improve vectorization of BF16 (PR #88486)
Krzysztof Drewniak via llvm-commits
llvm-commits at lists.llvm.org
Wed May 1 16:26:28 PDT 2024
================
@@ -56517,17 +56501,40 @@ static SDValue combineFP16_TO_FP(SDNode *N, SelectionDAG &DAG,
static SDValue combineFP_EXTEND(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {
+ EVT VT = N->getValueType(0);
+ bool IsStrict = N->isStrictFPOpcode();
+ SDValue Src = N->getOperand(IsStrict ? 1 : 0);
+ EVT SrcVT = Src.getValueType();
+
+ SDLoc dl(N);
+ if (SrcVT.getScalarType() == MVT::bf16) {
+ if (!IsStrict && Src.getOpcode() == ISD::FP_ROUND &&
+ Src.getOperand(0).getValueType() == VT)
----------------
krzysz00 wrote:
Hi, several months late, but I don't think this optimization is correct even the absence of some sort of strictness mode.
That is, `extend(round(x))` isn't meaningfully equal to `x` to a reasonable level of precision - the code I set this off with is explicitly using
```llvm
; This is probably vectorized by the time it hits you
%vRef = load float, ptr %p.iter ; in a loop, etc.
%vTrunc = fptrunc float %vRef to bfloat
%vReExt = fpext bfloat %vTrunc to float
store float %vReExt, ptr %p.iter
```
in order to "lose" extra precision from a f32 computation being used as a reference for a bfloat one.
(The high-level structure goes like this
```
bfloat[N] cTest;
gpu_run(kernel, ..., cTest);
float[N] cTestExt = bfloat_to_float(cTest);
float[N] cRef;
gpu_run(refKernel, ..., cRef);
cRef = float_to_bfloat(bfloat_to_float(cRef));
test_accuracy(cTestExt, cRef, N, ...);
```
I'll also note that this transformation isn't, from what I can tell, present for `half`
Please revert this conditional specifically or justify it, thanks!
https://github.com/llvm/llvm-project/pull/88486
More information about the llvm-commits
mailing list