[PATCH] D150913: [Clang][Bfloat16] Upgrade __bf16 to arithmetic type, change mangling, and extend excess precision support.

Wed May 24 11:07:25 PDT 2023

rjmccall added inline comments.

================
Comment at: clang/docs/LanguageExtensions.rst:852
 ``double`` when passed to ``printf``, so the programmer must explicitly cast it to
 ``double`` before using it with an ``%f`` or similar specifier.

----------------
pengfei wrote:
> rjmccall wrote:
> > Suggested rework:
> > 
> > ```
> > Clang supports three half-precision (16-bit) floating point types: ``__fp16``,
> > ``_Float16`` and ``__bf16``.  These types are supported in all language
> > modes, but not on all targets:
> > 
> > - ``__fp16`` is supported on every target.
> > 
> > - ``_Float16`` is currently supported on the following targets:
> >   * 32-bit ARM (natively on some architecture versions)
> >   * 64-bit ARM (AArch64) (natively on ARMv8.2a and above)
> >   * AMDGPU (natively)
> >   * SPIR (natively)
> >   * X86 (if SSE2 is available; natively if AVX512-FP16 is also available)
> > 
> > - ``__bf16`` is currently supported on the following targets:
> >   * 32-bit ARM
> >   * 64-bit ARM (AArch64)
> >   * X86 (when SSE2 is available)
> > 
> > (For X86, SSE2 is available on 64-bit and all recent 32-bit processors.)
> > 
> > ``__fp16`` and ``_Float16`` both use the binary16 format from IEEE
> > 754-2008, which provides a 5-bit exponent and an 11-bit significand
> > (counting the implicit leading 1).  ``__bf16`` uses the `bfloat16
> > <https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>`_ format,
> > which provides an 8-bit exponent and an 8-bit significand; this is the same
> > exponent range as `float`, just with greatly reduced precision.
> > 
> > ``_Float16`` and ``__bf16`` follow the usual rules for arithmetic
> > floating-point types.  Most importantly, this means that arithmetic operations
> > on operands of these types are formally performed in the type and produce
> > values of the type.  ``__fp16`` does not follow those rules: most operations
> > immediately promote operands of type ``__fp16`` to ``float``, and so
> > arithmetic operations are defined to be performed in ``float`` and so result in
> > a value of type ``float`` (unless further promoted because of other operands).
> > See below for more information on the exact specifications of these types.
> > 
> > Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
> > native hardware support for arithmetic in their corresponding formats.
> > The exact conditions are described in the lists above.  When compiling for a
> > processor without native support, Clang will perform the arithmetic in
> > ``float``, inserting extensions and truncations as necessary.  This can be
> > done in a way that exactly emulates the behavior of hardware support for
> > arithmetic, but it can require many extra operations.  By default, Clang takes
> > advantage of the C standard's allowances for excess precision in intermediate
> > operands in order to eliminate intermediate truncations within statements.
> > This is generally much faster but can generate different results from strict
> > operation-by-operation emulation.
> > 
> > The use of excess precision can be independently controlled for these two
> > types with the ``-ffloat16-excess-precision=`` and
> > ``-fbfloat16-excess-precision=`` options.  Valid values include:
> > - ``none`` (meaning to perform strict operation-by-operation emulation)
> > - ``standard`` (meaning that excess precision is permitted under the rules
> >   described in the standard, i.e. never across explicit casts or statements)
> > - ``fast`` (meaning that excess precision is permitted whenever the
> >   optimizer sees an opportunity to avoid truncations; currently this has no
> >   effect beyond ``standard``)
> > 
> > The ``_Float16`` type is an interchange floating type specified in
> >  ISO/IEC TS 18661-3:2015 ("Floating-point extensions for C").  It will
> > be supported on more targets as they define ABIs for it.
> > 
> > The ``__bf16`` type is a non-standard extension, but it generally follows
> > the rules for arithmetic interchange floating types from ISO/IEC TS
> > 18661-3:2015.  In previous versions of Clang, it was a storage-only type
> > that forbade arithmetic operations.  It will be supported on more targets
> > as they define ABIs for it.
> > 
> > The ``__fp16`` type was originally an ARM extension and is specified
> > by the `ARM C Language Extensions <https://github.com/ARM-software/acle/releases>`_.
> > Clang uses the ``binary16`` format from IEEE 754-2008 for ``__fp16``,
> > not the ARM alternative format.  Operators that expect arithmetic operands
> > immediately promote ``__fp16`` operands to ``float``.
> > 
> > It is recommended that portable code use ``_Float16`` instead of ``__fp16``,
> > as it has been defined by the C standards committee and has behavior that is
> > more familiar to most programmers.
> > 
> > Because ``__fp16`` operands are always immediately promoted to ``float``, the
> > common real type of ``__fp16`` and ``_Float16`` for the purposes of the usual
> > arithmetic conversions is ``float``.
> > 
> > A literal can be given ``_Float16`` type using the suffix ``f16``. For example,
> > ``3.14f16``.
> > 
> > Because default argument promotion only applies to the standard floating-point
> > types, ``_Float16`` values are not promoted to ``double`` when passed as variadic
> > or untyped arguments.  As a consequence, some caution must be taken when using
> > certain library facilities with ``_Float16``; for example, there is no ``printf`` format
> > specifier for ``_Float16``, and (unlike ``float``) it will not be implicitly promoted to
> > ``double`` when passed to ``printf``, so the programmer must explicitly cast it to
> > ``double`` before using it with an ``%f`` or similar specifier.
> > ```
> ```
> Only some of the supported processors for ``__fp16`` and ``__bf16`` offer
> native hardware support for arithmetic in their corresponding formats.
> ```
> 
> Do you mean ``_Float16``?
> 
> ```
> The exact conditions are described in the lists above.  When compiling for a
> processor without native support, Clang will perform the arithmetic in
> ``float``, inserting extensions and truncations as necessary.
> ```
> 
> It's a bit conflict with `These types are supported in all language modes, but not on all targets`.
> Why do we need to emulate for a type that doesn't necessarily support on all target?
> 
> My understand is that inserting extensions and truncations are used for 2 purposes:
> 1. A type that is designed to support all target. For now, it's only used for __fp16.
> 2. Support excess-precision=`standard`. This applies for both _Float16 and __bf16.
> 
> Do you mean `_Float16`?

Yes, thank you.  I knew I'd screw that up somewhere.

> Why do we need to emulate for a type that doesn't necessarily support on all target?

Would this be clearer?

```
Arithmetic on ``_Float16`` and ``__bf16`` is enabled on some targets that don't
provide native architectural support for arithmetic on these formats.  These
targets are noted in the lists of supported targets above.  On these targets,
Clang will perform the arithmetic in ``float``, inserting extensions and truncations
as necessary.
```

> My understand is that inserting extensions and truncations are used for 2 purposes:

No, I believe we always insert extensions and truncations.  The cases you're describing are places we insert extensions and truncations in the *frontend*, so that the backend doesn't see operations on `half` / `bfloat` at all.  But when these operations do make it to the backend, and there's no direct architectural support for them on the target, the backend still just inserts extensions and truncations so it can do the arithmetic in `float`.  This is clearest in the ARM codegen (https://godbolt.org/z/q9KoGEYqb) because the conversions are just instructions, but you can also see it in the X86 codegen (https://godbolt.org/z/ejdd4P65W): all the runtime functions are just extensions/truncations, and the actual arithmetic is done with `mulss` and `addss`.  This frontend/backend distinction is not something that matters to users, so the documentation glosses over the difference.

I haven't done an exhaustive investigation, so it's possible that there are types and targets where we emit a compiler-rt call to do each operation instead, but those compiler-rt functions almost certainly just do an extension to float in the same way, so I don't think the documentation as written would be misleading for those targets, either.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150913/new/

https://reviews.llvm.org/D150913