[llvm] [docs] Add guide for Undefined Behavior (PR #119220)

Tue Dec 10 01:12:08 PST 2024

================
@@ -0,0 +1,351 @@
+======================================
+LLVM IR Undefined Behavior (UB) Manual
+======================================
+
+.. contents::
+   :local:
+   :depth: 2
+
+Abstract
+========
+This document describes the undefined behavior (UB) in LLVM's IR, including
+undef and poison values, as well as the ``freeze`` instruction.
+We also provide guidelines on when to use each form of UB.
+
+
+Introduction
+============
+Undefined behavior is used to specify the behavior of corner cases for which we
+don't wish to specify the concrete results. UB is also used to provide
+additional constraints to the optimizers (e.g., assumptions that the frontend
+guarantees through the language type system or the runtime).
+For example, we could specify the result of division by zero as zero, but
+since we are not really interested in the result, we say it is UB.
+
+There are two forms of UB in LLVM: immediate UB and deferred UB (undef and
+poison values).
+The lattice of values in LLVM is:
+immediate UB > poison > undef > freeze > concrete value.
+
+
+Immediate UB
+============
+Immediate UB is the most severe form of UB. It should be avoided whenever
+possible.
+Immediate UB should be used only for operations that trap in most CPUs supported
+by LLVM.
+Examples include division by zero, dereferencing a null pointer, etc.
+
+The reason that immediate UB should be avoided is that it makes optimizations
+such as hoisting a lot harder.
+Consider the following example:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      br i1 %c, label %then, label %else
+
+    then:
+      %div = udiv i32 3, %v
+      br label %ret
+
+    else:
+      br label %ret
+
+    ret:
+      %r = phi i32 [ %div, %then ], [ 0, %else ]
+      ret i32 %r
+    }
+
+We might be tempted to simplify this function by removing the branching and
+executing the division speculatively because ``%c`` is true most of times.
+We would obtain the following IR:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      %div = udiv i32 3, %v
+      %r = select i1 %c, i32 %div, i32 0
+      ret i32 %r
+    }
+
+However, this transformation is not correct! Since division triggers UB
+when the divisor is zero, we can only execute speculatively if we are sure we
+don't hit that condition.
+For the function above, when called like ``f(false, 0)``, before the optimization
+it would return 0, and after the optimization it now triggers UB.
+
+This example highlights why we minimize the cases that trigger immediate UB
+as much as possible.
+As a rule of thumb, use immediate UB only for the cases that trap the CPU for
+most of the supported architectures.
+
+
+Deferred UB
+===========
+Deferred UB is a lighter form of UB. It enables instructions to be executed
+speculatively while marking some corner cases as having erroneous values.
+Deferred UB should be used for cases where the semantics offered by common
+CPUs differ, but the CPU does not trap.
+
+As an example, consider the shift instructions. The x86 and ARM architectures
+offer different semantics when the shift amount is equal to or greater than
+the bitwidth.
+We could solve this tension in one of two ways: 1) pick one of the x86/ARM
+semantics for LLVM, which would make the code emitted for the other architecture
+slower; 2) define that case as yielding ``poison``.
+LLVM chose the latter option. For frontends for languages like C or C++
+(e.g., clang), they can map shifts in the source program directly to a shift in
+LLVM IR, since the semantics of C and C++ define such shifts as UB.
+For languages that offer strong semantics, they must use the value of the shift
+conditionally, e.g.:
+
+.. code-block:: llvm
+
+    define i32 @x86_shift(i32 %a, i32 %b) {
+      %mask = and i32 %b, 31
+      %shift = shl i32 %a, %mask
+      ret i32 %shift
+    }
+
+
+There are two deferred UB values in LLVM: ``undef`` and ``poison``, which we
+describe next.
+
+
+Undef Values
+------------
+.. warning::
+   Undef values are deprecated and should be used only when strictly necessary.
+   No new uses should be added unless justified.
+
+An undef value represents any value of a given type. Moreover, each use of
+an instruction that depends on undef can observe a different value.
+For example:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      %add = add i32 undef, 0
+      %ret = add i32 %add, %add
+      ret i32 %ret
+    }
+
+Unsurprisingly, the first addition yields ``undef``.
+However, the result of the second addition is more subtle. We might be tempted
+to think that it yields an even number. But it might not be!
+Since each (transitive) use of ``undef`` can observe a different value,
+the second addition is equivalent to ``add i32 undef, undef``, which is
+equivalent to ``undef``.
+Hence, the function above is equivalent to:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      ret i32 undef
+    }
+
+Each call to this function may observe a different value, namely any 32-bit
+number (even and odd).
+
+Because each use of undef can observe a different value, some optimizations
+are wrong if we are not sure a value is not undef.
+Consider a function that multiplies a number by 2:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = mul i32 %v, 2
+      ret i32 %mul2
+    }
+
+This function is guaranteed to return an even number, even if ``%v`` is
+undef.
+However, as we've seen above, the following function does not:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = add i32 %v, %v
+      ret i32 %mul2
+    }
+
+This optimization is wrong just because undef values exist, even if they are
+not used in this part of the program as LLVM has no way to tell if ``%v`` is
+undef or not.
+
+.. note::
+   Uses of undef values should be restricted to representing loads of
+   uninitialized memory. This is the only part of the IR semantics that cannot
+   be replaced with alternatives yet (work in ongoing).
+
+Looking at the value lattice, ``undef`` values can only be replaced with either
+a ``freeze`` instruction or a concrete value.
+A consequence is that giving undef as an operand to an instruction that triggers
+UB for some values of that operand makes the program UB. For example,
+``udiv %x, undef`` is UB since we replace undef with 0 (``udiv %x, 0``),
+becoming obvious that it is UB.
+
+
+Poison Values
+-------------
+Poison values are a stronger from of deferred UB than undef. They still
+allow instructions to be executed speculatively, but they taint the whole
+expression DAG (with some exceptions), akin to floating point NaN values.
+
+Example:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %a, i32 %b, i32 %c) {
+      %add = add nsw i32 %a, %b
+      %ret = add nsw i32 %add, %c
+      ret i32 %ret
+    }
+
+The ``nsw`` attribute in the additions indicates that the operation yields
+poison if there is a signed overflow.
+If the first addition overflows, ``%add`` is poison and thus ``%ret`` is also
+poison since it taints the whole expression DAG.
+
+Poison values can be replaced with any value of type (undef, concrete values,
----------------
antoniofrighetto wrote:

Poison can be replaced with undef, but we have been implying this use is restrained, as we ideally would like to move from poison to freeze? Don't we have to necessarily pass through freeze before replacing poison with a concrete value?

https://github.com/llvm/llvm-project/pull/119220