[llvm] [docs] Add guide for Undefined Behavior (PR #119220)

Tue Dec 10 01:12:07 PST 2024

================
@@ -0,0 +1,351 @@
+======================================
+LLVM IR Undefined Behavior (UB) Manual
+======================================
+
+.. contents::
+   :local:
+   :depth: 2
+
+Abstract
+========
+This document describes the undefined behavior (UB) in LLVM's IR, including
+undef and poison values, as well as the ``freeze`` instruction.
+We also provide guidelines on when to use each form of UB.
+
+
+Introduction
+============
+Undefined behavior is used to specify the behavior of corner cases for which we
+don't wish to specify the concrete results. UB is also used to provide
+additional constraints to the optimizers (e.g., assumptions that the frontend
+guarantees through the language type system or the runtime).
+For example, we could specify the result of division by zero as zero, but
+since we are not really interested in the result, we say it is UB.
+
+There are two forms of UB in LLVM: immediate UB and deferred UB (undef and
+poison values).
+The lattice of values in LLVM is:
+immediate UB > poison > undef > freeze > concrete value.
+
+
+Immediate UB
+============
+Immediate UB is the most severe form of UB. It should be avoided whenever
+possible.
+Immediate UB should be used only for operations that trap in most CPUs supported
+by LLVM.
+Examples include division by zero, dereferencing a null pointer, etc.
+
+The reason that immediate UB should be avoided is that it makes optimizations
+such as hoisting a lot harder.
+Consider the following example:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      br i1 %c, label %then, label %else
+
+    then:
+      %div = udiv i32 3, %v
+      br label %ret
+
+    else:
+      br label %ret
+
+    ret:
+      %r = phi i32 [ %div, %then ], [ 0, %else ]
+      ret i32 %r
+    }
+
+We might be tempted to simplify this function by removing the branching and
+executing the division speculatively because ``%c`` is true most of times.
+We would obtain the following IR:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      %div = udiv i32 3, %v
+      %r = select i1 %c, i32 %div, i32 0
+      ret i32 %r
+    }
+
+However, this transformation is not correct! Since division triggers UB
+when the divisor is zero, we can only execute speculatively if we are sure we
+don't hit that condition.
+For the function above, when called like ``f(false, 0)``, before the optimization
+it would return 0, and after the optimization it now triggers UB.
+
+This example highlights why we minimize the cases that trigger immediate UB
+as much as possible.
+As a rule of thumb, use immediate UB only for the cases that trap the CPU for
+most of the supported architectures.
+
+
+Deferred UB
+===========
+Deferred UB is a lighter form of UB. It enables instructions to be executed
+speculatively while marking some corner cases as having erroneous values.
+Deferred UB should be used for cases where the semantics offered by common
+CPUs differ, but the CPU does not trap.
+
+As an example, consider the shift instructions. The x86 and ARM architectures
+offer different semantics when the shift amount is equal to or greater than
+the bitwidth.
+We could solve this tension in one of two ways: 1) pick one of the x86/ARM
+semantics for LLVM, which would make the code emitted for the other architecture
+slower; 2) define that case as yielding ``poison``.
+LLVM chose the latter option. For frontends for languages like C or C++
+(e.g., clang), they can map shifts in the source program directly to a shift in
+LLVM IR, since the semantics of C and C++ define such shifts as UB.
+For languages that offer strong semantics, they must use the value of the shift
+conditionally, e.g.:
+
+.. code-block:: llvm
+
+    define i32 @x86_shift(i32 %a, i32 %b) {
+      %mask = and i32 %b, 31
+      %shift = shl i32 %a, %mask
+      ret i32 %shift
+    }
+
+
+There are two deferred UB values in LLVM: ``undef`` and ``poison``, which we
+describe next.
+
+
+Undef Values
+------------
+.. warning::
+   Undef values are deprecated and should be used only when strictly necessary.
+   No new uses should be added unless justified.
+
+An undef value represents any value of a given type. Moreover, each use of
+an instruction that depends on undef can observe a different value.
+For example:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      %add = add i32 undef, 0
+      %ret = add i32 %add, %add
+      ret i32 %ret
+    }
+
+Unsurprisingly, the first addition yields ``undef``.
+However, the result of the second addition is more subtle. We might be tempted
+to think that it yields an even number. But it might not be!
+Since each (transitive) use of ``undef`` can observe a different value,
+the second addition is equivalent to ``add i32 undef, undef``, which is
+equivalent to ``undef``.
+Hence, the function above is equivalent to:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      ret i32 undef
+    }
+
+Each call to this function may observe a different value, namely any 32-bit
+number (even and odd).
+
+Because each use of undef can observe a different value, some optimizations
+are wrong if we are not sure a value is not undef.
+Consider a function that multiplies a number by 2:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = mul i32 %v, 2
+      ret i32 %mul2
+    }
+
+This function is guaranteed to return an even number, even if ``%v`` is
+undef.
+However, as we've seen above, the following function does not:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = add i32 %v, %v
+      ret i32 %mul2
+    }
+
+This optimization is wrong just because undef values exist, even if they are
+not used in this part of the program as LLVM has no way to tell if ``%v`` is
+undef or not.
+
+.. note::
+   Uses of undef values should be restricted to representing loads of
+   uninitialized memory. This is the only part of the IR semantics that cannot
+   be replaced with alternatives yet (work in ongoing).
+
+Looking at the value lattice, ``undef`` values can only be replaced with either
+a ``freeze`` instruction or a concrete value.
+A consequence is that giving undef as an operand to an instruction that triggers
+UB for some values of that operand makes the program UB. For example,
+``udiv %x, undef`` is UB since we replace undef with 0 (``udiv %x, 0``),
+becoming obvious that it is UB.
+
+
+Poison Values
+-------------
+Poison values are a stronger from of deferred UB than undef. They still
----------------
antoniofrighetto wrote:

```suggestion
Poison values are a stronger form of deferred UB than undef. They still
```

https://github.com/llvm/llvm-project/pull/119220