[PATCH] D150913: [Clang][BFloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support.

Thu May 25 19:16:26 PDT 2023

pengfei added inline comments.

================
Comment at: clang/docs/LanguageExtensions.rst:852
 ``double`` when passed to ``printf``, so the programmer must explicitly cast it to
 ``double`` before using it with an ``%f`` or similar specifier.

----------------
rjmccall wrote:
> codemzs wrote:
> > pengfei wrote:
> > > rjmccall wrote:
> > > > pengfei wrote:
> > > > > rjmccall wrote:
> > > > > > Suggested rework:
> > > > > > 
> > > > > > ```
> > > > > > Clang supports three half-precision (16-bit) floating point types: ``__fp16``,
> > > > > > ``_Float16`` and ``__bf16``.  These types are supported in all language
> > > > > > modes, but not on all targets:
> > > > > > 
> > > > > > - ``__fp16`` is supported on every target.
> > > > > > 
> > > > > > - ``_Float16`` is currently supported on the following targets:
> > > > > >   * 32-bit ARM (natively on some architecture versions)
> > > > > >   * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
> > > > > >   * AMDGPU (natively)
> > > > > >   * SPIR (natively)
> > > > > >   * X86 (if SSE2 is available; natively if AVX512-FP16 is also available)
> > > > > > 
> > > > > > - ``__bf16`` is currently supported on the following targets:
> > > > > >   * 32-bit ARM
> > > > > >   * 64-bit ARM (AArch64)
> > > > > >   * X86 (when SSE2 is available)
> > > > > > 
> > > > > > (For X86, SSE2 is available on 64-bit and all recent 32-bit processors.)
> > > > > > 
> > > > > > ``__fp16`` and ``_Float16`` both use the binary16 format from IEEE
> > > > > > 754-2008, which provides a 5-bit exponent and an 11-bit significand
> > > > > > (counting the implicit leading 1).  ``__bf16`` uses the `bfloat16
> > > > > > <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_ format,
> > > > > > which provides an 8-bit exponent and an 8-bit significand; this is the same
> > > > > > exponent range as `float`, just with greatly reduced precision.
> > > > > > 
> > > > > > ``_Float16`` and ``__bf16`` follow the usual rules for arithmetic
> > > > > > floating-point types.  Most importantly, this means that arithmetic operations
> > > > > > on operands of these types are formally performed in the type and produce
> > > > > > values of the type.  ``__fp16`` does not follow those rules: most operations
> > > > > > immediately promote operands of type ``__fp16`` to ``float``, and so
> > > > > > arithmetic operations are defined to be performed in ``float`` and so result in
> > > > > > a value of type ``float`` (unless further promoted because of other operands).
> > > > > > See below for more information on the exact specifications of these types.
> > > > > > 
> > > > > > Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
> > > > > > native hardware support for arithmetic in their corresponding formats.
> > > > > > The exact conditions are described in the lists above.  When compiling for a
> > > > > > processor without native support, Clang will perform the arithmetic in
> > > > > > ``float``, inserting extensions and truncations as necessary.  This can be
> > > > > > done in a way that exactly emulates the behavior of hardware support for
> > > > > > arithmetic, but it can require many extra operations.  By default, Clang takes
> > > > > > advantage of the C standard's allowances for excess precision in intermediate
> > > > > > operands in order to eliminate intermediate truncations within statements.
> > > > > > This is generally much faster but can generate different results from strict
> > > > > > operation-by-operation emulation.
> > > > > > 
> > > > > > The use of excess precision can be independently controlled for these two
> > > > > > types with the ``-ffloat16-excess-precision=`` and
> > > > > > ``-fbfloat16-excess-precision=`` options.  Valid values include:
> > > > > > - ``none`` (meaning to perform strict operation-by-operation emulation)
> > > > > > - ``standard`` (meaning that excess precision is permitted under the rules
> > > > > >   described in the standard, i.e. never across explicit casts or statements)
> > > > > > - ``fast`` (meaning that excess precision is permitted whenever the
> > > > > >   optimizer sees an opportunity to avoid truncations; currently this has no
> > > > > >   effect beyond ``standard``)
> > > > > > 
> > > > > > The ``_Float16`` type is an interchange floating type specified in
> > > > > >  ISO/IEC TS 18661-3:2015 ("Floating-point extensions for C").  It will
> > > > > > be supported on more targets as they define ABIs for it.
> > > > > > 
> > > > > > The ``__bf16`` type is a non-standard extension, but it generally follows
> > > > > > the rules for arithmetic interchange floating types from ISO/IEC TS
> > > > > > 18661-3:2015.  In previous versions of Clang, it was a storage-only type
> > > > > > that forbade arithmetic operations.  It will be supported on more targets
> > > > > > as they define ABIs for it.
> > > > > > 
> > > > > > The ``__fp16`` type was originally an ARM extension and is specified
> > > > > > by the `ARM C Language Extensions <https://github.com/ARM-software/acle/releases>`_.
> > > > > > Clang uses the ``binary16`` format from IEEE 754-2008 for ``__fp16``,
> > > > > > not the ARM alternative format.  Operators that expect arithmetic operands
> > > > > > immediately promote ``__fp16`` operands to ``float``.
> > > > > > 
> > > > > > It is recommended that portable code use ``_Float16`` instead of ``__fp16``,
> > > > > > as it has been defined by the C standards committee and has behavior that is
> > > > > > more familiar to most programmers.
> > > > > > 
> > > > > > Because ``__fp16`` operands are always immediately promoted to ``float``, the
> > > > > > common real type of ``__fp16`` and ``_Float16`` for the purposes of the usual
> > > > > > arithmetic conversions is ``float``.
> > > > > > 
> > > > > > A literal can be given ``_Float16`` type using the suffix ``f16``. For example,
> > > > > > ``3.14f16``.
> > > > > > 
> > > > > > Because default argument promotion only applies to the standard floating-point
> > > > > > types, ``_Float16`` values are not promoted to ``double`` when passed as variadic
> > > > > > or untyped arguments.  As a consequence, some caution must be taken when using
> > > > > > certain library facilities with ``_Float16``; for example, there is no ``printf`` format
> > > > > > specifier for ``_Float16``, and (unlike ``float``) it will not be implicitly promoted to
> > > > > > ``double`` when passed to ``printf``, so the programmer must explicitly cast it to
> > > > > > ``double`` before using it with an ``%f`` or similar specifier.
> > > > > > ```
> > > > > ```
> > > > > Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
> > > > > native hardware support for arithmetic in their corresponding formats.
> > > > > ```
> > > > > 
> > > > > Do you mean ``_Float16``?
> > > > > 
> > > > > ```
> > > > > The exact conditions are described in the lists above.  When compiling for a
> > > > > processor without native support, Clang will perform the arithmetic in
> > > > > ``float``, inserting extensions and truncations as necessary.
> > > > > ```
> > > > > 
> > > > > It's a bit conflict with `These types are supported in all language modes, but not on all targets`.
> > > > > Why do we need to emulate for a type that doesn't necessarily support on all target?
> > > > > 
> > > > > My understand is that inserting extensions and truncations are used for 2 purposes:
> > > > > 1. A type that is designed to support all target. For now, it's only used for __fp16.
> > > > > 2. Support excess-precision=`standard`. This applies for both _Float16 and __bf16.
> > > > > 
> > > > > Do you mean `_Float16`?
> > > > 
> > > > Yes, thank you.  I knew I'd screw that up somewhere.
> > > > 
> > > > > Why do we need to emulate for a type that doesn't necessarily support on all target?
> > > > 
> > > > Would this be clearer?
> > > > 
> > > > ```
> > > > Arithmetic on ``_Float16`` and ``__bf16`` is enabled on some targets that don't
> > > > provide native architectural support for arithmetic on these formats.  These
> > > > targets are noted in the lists of supported targets above.  On these targets,
> > > > Clang will perform the arithmetic in ``float``, inserting extensions and truncations
> > > > as necessary.
> > > > ```
> > > > 
> > > > > My understand is that inserting extensions and truncations are used for 2 purposes:
> > > > 
> > > > No, I believe we always insert extensions and truncations.  The cases you're describing are places we insert extensions and truncations in the *frontend*, so that the backend doesn't see operations on `half` / `bfloat` at all.  But when these operations do make it to the backend, and there's no direct architectural support for them on the target, the backend still just inserts extensions and truncations so it can do the arithmetic in `float`.  This is clearest in the ARM codegen (https://godbolt.org/z/q9KoGEYqb) because the conversions are just instructions, but you can also see it in the X86 codegen (https://godbolt.org/z/ejdd4P65W): all the runtime functions are just extensions/truncations, and the actual arithmetic is done with `mulss` and `addss`.  This frontend/backend distinction is not something that matters to users, so the documentation glosses over the difference.
> > > > 
> > > > I haven't done an exhaustive investigation, so it's possible that there are types and targets where we emit a compiler-rt call to do each operation instead, but those compiler-rt functions almost certainly just do an extension to float in the same way, so I don't think the documentation as written would be misleading for those targets, either.
> > > Thanks for the explanation! Sorry, I failed to make the distinction between "support" and "natively support", I guess users may be confusing at the beginning too.
> > > 
> > > I agree the documentation is to explain the whole behavior of compile to user. I think we have 3 aspects that want to tell users:
> > > 
> > > 1. Whether a type is arithmetic type or not and is (natively) supported by all targets or just a few;
> > > 2. The result of a type may not be consistent across different targets or/and excess-precision value;
> > > 3. The excess-precision control doesn't take effect if a type is natively supported by targets;
> > > 
> > > It would be more clear if we can give such a summary before the detailed explanation.
> > Does adding the below to the top of the description make it more clear?
> > 
> > Half-Precision Floating Point
> > =============================
> > 
> > Clang supports three half-precision (16-bit) floating point types: ``__fp16``, ``_Float16`` and ``__bf16``. These types are supported in all language modes, but their support differs across targets. Here, it's important to understand the difference between "support" and "natively support":
> > 
> > - A type is "supported" if the compiler can handle code using that type, which might involve translating operations into an equivalent code that the target hardware understands.
> > - A type is "natively supported" if the hardware itself understands the type and can perform operations on it directly. This typically yields better performance and more accurate results.
> > 
> > Another crucial aspect to note is the consistency of the result of a type across different targets and excess-precision values. Different hardware (targets) might produce slightly different results due to the level of precision they support and how they handle excess-precision values. It means the same code can yield different results when compiled for different hardware.
> > 
> > Finally, note that the control of excess-precision does not take effect if a type is natively supported by targets. If the hardware supports the type directly, the compiler does not need to (and cannot) use excess precision to potentially speed up the operations.
> > 
> > Given these points, here is the detailed support for each type:
> > 
> > - ``__fp16`` is supported on every target.
> > 
> > - ``_Float16`` is currently supported on the following targets:
> >   * 32-bit ARM (natively on some architecture versions)
> >   * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
> >   * AMDGPU (natively)
> >   * SPIR (natively)
> >   * X86 (if SSE2 is available; natively if AVX512-FP16 is also available)
> > 
> > - ``__bf16`` is currently supported on the following targets:
> >   * 32-bit ARM
> >   * 64-bit ARM (AArch64)
> >   * X86 (when SSE2 is available)
> > 
> > ...
> > ...
> I think that's a good basic idea, but it's okay to leave some of the detail for later.  How about this:
> 
> ```
> Clang supports three half-precision (16-bit) floating point types: ``__fp16``, ``_Float16`` and ``__bf16``. These types are supported in all language modes, but their support differs between targets.  A target is said to have "native support" for a type if the target processor offers instructions for directly performing basic arithmetic on that type.  In the absence of native support, a type can still be supported if the compiler can emulate arithmetic on the type by promoting to ``float``; see below for more information on this emulation.
> 
> * ``__fp16`` is supported on all targets.  The special semantics of this type mean that no arithmetic is ever performed directly on ``__fp16`` values; see below.
> 
> * ``_Float16`` is supported on the following targets: (...)
> 
> * ``__bf16`` is supported on the following targets (currently never natively): (...)
> ```
> 
> And then below we can adjust the paragraph about emulation:
> 
> ```
> When compiling arithmetic on ``_Float16`` and ``__bf16`` for a target without
> native support, Clang will perform the arithmetic in ``float``, inserting extensions
> and truncations as necessary.  This can be done in a way that exactly matches the
> operation-by-operation behavior of native support, but that can require many
> extra truncations and extensions.  By default, when emulating ``_Float16`` and
> ``__bf16`` arithmetic using ``float``, Clang does not truncate intermediate operands
> back to their true type unless the operand is the result of an explicit cast or
> assignment.  This is generally much faster but can generate different results from
> strict operation-by-operation emulation.  (Usually the results are more precise.)
> This is permitted by the C and C++ standards under the rules for excess precision
> in intermediate operands; see the discussion of evaluation formats in the C
> standard and [expr.pre] in the C++ standard.
> ```
This revision looks better. The contents are rather clear to me. Thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150913/new/

https://reviews.llvm.org/D150913