[llvm] 0100c63 - [docs] Add guide about Undefined Behavior (#119220)

Wed Dec 11 04:23:55 PST 2024

Author: Nuno Lopes
Date: 2024-12-11T12:23:51Z
New Revision: 0100c631f85480ecdf1b35f2aedbfc0200a81174

URL: https://github.com/llvm/llvm-project/commit/0100c631f85480ecdf1b35f2aedbfc0200a81174
DIFF: https://github.com/llvm/llvm-project/commit/0100c631f85480ecdf1b35f2aedbfc0200a81174.diff

LOG: [docs] Add guide about Undefined Behavior (#119220)


Thanks Antonio Frighetto, John Regehr, and Nikita Popov for reviewing this!

Added: 
    llvm/docs/UndefinedBehavior.rst

Modified: 
    llvm/docs/Reference.rst

Removed: 
    


################################################################################
diff  --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst
index df61628b06c7db..2cae9186d7f9bd 100644

--- a/llvm/docs/Reference.rst
+++ b/llvm/docs/Reference.rst
@@ -52,6 +52,7 @@ LLVM and API reference documentation.
    TestingGuide
    TransformMetadata
    TypeMetadata
+   UndefinedBehavior
    XRay
    XRayExample
    XRayFDRFormat
@@ -120,6 +121,9 @@ LLVM IR
   Defines the LLVM intermediate representation and the assembly form of the
   
diff erent nodes.
 
+:doc:`Undefined Behavior (UB) <UndefinedBehavior>`
+  A guide on what UB/undef/poison are and when to use each one.
+
 :doc:`InAlloca`
   Description of the ``inalloca`` argument attribute.
 

diff  --git a/llvm/docs/UndefinedBehavior.rst b/llvm/docs/UndefinedBehavior.rst
new file mode 100644
index 00000000000000..f68bbbd505330a
--- /dev/null
+++ b/llvm/docs/UndefinedBehavior.rst
@@ -0,0 +1,393 @@
+======================================
+LLVM IR Undefined Behavior (UB) Manual
+======================================
+
+.. contents::
+   :local:
+   :depth: 2
+
+Abstract
+========
+This document describes the undefined behavior (UB) in LLVM's IR, including
+undef and poison values, as well as the ``freeze`` instruction.
+We also provide guidelines on when to use each form of UB.
+
+
+Introduction
+============
+Undefined behavior (UB) is used to specify the behavior of corner cases for
+which we don't wish to specify the concrete results. UB is also used to provide
+additional constraints to the optimizers (e.g., assumptions that the frontend
+guarantees through the language type system or the runtime).
+For example, we could specify the result of division by zero as zero, but
+since we are not really interested in the result, we say it is UB.
+
+There exist two forms of undefined behavior in LLVM: immediate UB and deferred
+UB. The latter comes in two flavors: undef and poison values.
+There is also a ``freeze`` instruction to tame the propagation of deferred UB.
+The lattice of values in LLVM is:
+immediate UB > poison > undef > freeze(poison) > concrete value.
+
+We explain each of the concepts in detail below.
+
+
+Immediate UB
+============
+Immediate UB is the most severe form of UB. It should be avoided whenever
+possible.
+Immediate UB should be used only for operations that trap in most CPUs supported
+by LLVM.
+Examples include division by zero, dereferencing a null pointer, etc.
+
+The reason that immediate UB should be avoided is that it makes optimizations
+such as hoisting a lot harder.
+Consider the following example:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      br i1 %c, label %then, label %else
+
+    then:
+      %div = udiv i32 3, %v
+      br label %ret
+
+    else:
+      br label %ret
+
+    ret:
+      %r = phi i32 [ %div, %then ], [ 0, %else ]
+      ret i32 %r
+    }
+
+We might be tempted to simplify this function by removing the branching and
+executing the division speculatively because ``%c`` is true most of times.
+We would obtain the following IR:
+
+.. code-block:: llvm
+
+    define i32 @f(i1 %c, i32 %v) {
+      %div = udiv i32 3, %v
+      %r = select i1 %c, i32 %div, i32 0
+      ret i32 %r
+    }
+
+However, this transformation is not correct! Since division triggers UB
+when the divisor is zero, we can only execute speculatively if we are sure we
+don't hit that condition.
+The function above, when called as ``f(false, 0)``, would return 0 before the
+optimization, and triggers UB after being optimized.
+
+This example highlights why we minimize the cases that trigger immediate UB
+as much as possible.
+As a rule of thumb, use immediate UB only for the cases that trap the CPU for
+most of the supported architectures.
+
+
+Time Travel
+-----------
+Immediate UB in LLVM IR allows the so-called time travelling. What this means
+is that if a program triggers UB, then we are not required to preserve any of
+its observable behavior, including I/O.
+For example, the following function triggers UB after calling ``printf``:
+
+.. code-block:: llvm
+
+    define void @fn() {
+      call void @printf(...) willreturn
+      unreachable
+    }
+
+Since we know that ``printf`` will always return, and because LLVM's UB can
+time-travel, it is legal to remove the call to ``printf`` altogether and
+optimize the function to simply:
+
+.. code-block:: llvm
+
+    define void @fn() {
+      unreachable
+    }
+
+
+Deferred UB
+===========
+Deferred UB is a lighter form of UB. It enables instructions to be executed
+speculatively while marking some corner cases as having erroneous values.
+Deferred UB should be used for cases where the semantics offered by common
+CPUs 
diff er, but the CPU does not trap.
+
+As an example, consider the shift instructions. The x86 and ARM architectures
+offer 
diff erent semantics when the shift amount is equal to or greater than
+the bitwidth.
+We could solve this tension in one of two ways: 1) pick one of the x86/ARM
+semantics for LLVM, which would make the code emitted for the other architecture
+slower; 2) define that case as yielding ``poison``.
+LLVM chose the latter option. For frontends for languages like C or C++
+(e.g., clang), they can map shifts in the source program directly to a shift in
+LLVM IR, since the semantics of C and C++ define such shifts as UB.
+For languages that offer strong semantics, they must use the value of the shift
+conditionally, e.g.:
+
+.. code-block:: llvm
+
+    define i32 @x86_shift(i32 %a, i32 %b) {
+      %mask = and i32 %b, 31
+      %shift = shl i32 %a, %mask
+      ret i32 %shift
+    }
+
+
+There are two deferred UB values in LLVM: ``undef`` and ``poison``, which we
+describe next.
+
+
+Undef Values
+------------
+.. warning::
+   Undef values are deprecated and should be used only when strictly necessary.
+   Uses of undef values should be restricted to representing loads of
+   uninitialized memory. This is the only part of the IR semantics that cannot
+   be replaced with alternatives yet (work in ongoing).
+
+An undef value represents any value of a given type. Moreover, each use of
+an instruction that depends on undef can observe a 
diff erent value.
+For example:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      %add = add i32 undef, 0
+      %ret = add i32 %add, %add
+      ret i32 %ret
+    }
+
+Unsurprisingly, the first addition yields ``undef``.
+However, the result of the second addition is more subtle. We might be tempted
+to think that it yields an even number. But it might not be!
+Since each (transitive) use of ``undef`` can observe a 
diff erent value,
+the second addition is equivalent to ``add i32 undef, undef``, which is
+equivalent to ``undef``.
+Hence, the function above is equivalent to:
+
+.. code-block:: llvm
+
+    define i32 @fn() {
+      ret i32 undef
+    }
+
+Each call to this function may observe a 
diff erent value, namely any 32-bit
+number (even and odd).
+
+Because each use of undef can observe a 
diff erent value, some optimizations
+are wrong if we are not sure a value is not undef.
+Consider a function that multiplies a number by 2:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = mul i32 %v, 2
+      ret i32 %mul2
+    }
+
+This function is guaranteed to return an even number, even if ``%v`` is
+undef.
+However, as we've seen above, the following function does not:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %v) {
+      %mul2 = add i32 %v, %v
+      ret i32 %mul2
+    }
+
+This optimization is wrong just because undef values exist, even if they are
+not used in this part of the program as LLVM has no way to tell if ``%v`` is
+undef or not.
+
+Looking at the value lattice, ``undef`` values can only be replaced with either
+a ``freeze`` instruction or a concrete value.
+A consequence is that giving undef as an operand to an instruction that triggers
+UB for some values of that operand makes the program UB. For example,
+``udiv %x, undef`` is UB since we replace undef with 0 (``udiv %x, 0``),
+becoming obvious that it is UB.
+
+
+Poison Values
+-------------
+Poison values are a stronger form of deferred UB than undef. They still
+allow instructions to be executed speculatively, but they taint the whole
+expression DAG (with some exceptions), akin to floating point NaN values.
+
+Example:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %a, i32 %b, i32 %c) {
+      %add = add nsw i32 %a, %b
+      %ret = add nsw i32 %add, %c
+      ret i32 %ret
+    }
+
+The ``nsw`` attribute in the additions indicates that the operation yields
+poison if there is a signed overflow.
+If the first addition overflows, ``%add`` is poison and thus ``%ret`` is also
+poison since it taints the whole expression DAG.
+
+Poison values can be replaced with any value of type (undef, concrete values,
+or a ``freeze`` instruction).
+
+
+The Freeze Instruction
+======================
+Both undef and poison values sometimes propagate too much down an expression
+DAG. Undef values because each transitive use can observe a 
diff erent value,
+and poison values because they make the whole DAG poison.
+There are some cases where it is important to stop such propagation.
+This is where the ``freeze`` instruction comes in.
+
+Take the following example function:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %n, i1 %c) {
+    entry:
+      br label %loop
+
+   loop:
+      %i = phi i32 [ 0, %entry ], [ %i2, %loop.end ]
+      %cond = icmp ule i32 %i, %n
+      br i1 %cond, label %loop.cont, label %exit
+
+   loop.cont:
+      br i1 %c, label %then, label %else
+
+    then:
+      ...
+      br label %loop.end
+
+    else:
+      ...
+      br label %loop.end
+
+    loop.end:
+      %i2 = add i32 %i, 1
+      br label %loop
+
+    exit:
+      ...
+    }
+
+Imagine we want to perform loop unswitching on the loop above since the branch
+condition inside the loop is loop invariant.
+We would obtain the following IR:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %n, i1 %c) {
+    entry:
+      br i1 %c, label %then, label %else
+
+   then:
+      %i = phi i32 [ 0, %entry ], [ %i2, %then.cont ]
+      %cond = icmp ule i32 %i, %n
+      br i1 %cond, label %then.cont, label %exit
+
+   then.cont:
+      ...
+      %i2 = add i32 %i, 1
+      br label %then
+
+   else:
+      %i3 = phi i32 [ 0, %entry ], [ %i4, %else.cont ]
+      %cond = icmp ule i32 %i3, %n
+      br i1 %cond, label %else.cont, label %exit
+
+   else.cont:
+      ...
+      %i4 = add i32 %i3, 1
+      br label %else
+
+    exit:
+      ...
+    }
+
+There is a subtle catch: when the function is called with ``%n`` being zero,
+the original function did not branch on ``%c``, while the optimized one does.
+Branching on a deferred UB value is immediate UB, hence the transformation is
+wrong in general because ``%c`` may be undef or poison.
+
+Cases like this need a way to tame deferred UB values. This is exactly what the
+``freeze`` instruction is for!
+When given a concrete value as argument, ``freeze`` is a no-op, returning the
+argument as-is. When given an undef or poison value, ``freeze`` returns a
+non-deterministic value of the type.
+This is not the same as undef: the value returned by ``freeze`` is the same
+for all users.
+
+Branching on a value returned by ``freeze`` is always safe since it either
+evaluates to true or false consistently.
+We can make the loop unswitching optimization above correct as follows:
+
+.. code-block:: llvm
+
+    define i32 @fn(i32 %n, i1 %c) {
+    entry:
+      %c2 = freeze i1 %c
+      br i1 %c2, label %then, label %else
+
+
+Writing Tests Without Undefined Behavior
+========================================
+
+When writing tests, it is important to ensure that they don't trigger UB
+unnecessarily. Some automated test reduces sometimes use undef or poison
+values as dummy values, but this is considered a bad practice if this leads
+to triggering UB.
+
+For example, imagine that we want to write a test and we don't care about the
+particular divisor value because our optimization kicks in regardless:
+
+.. code-block:: llvm
+
+    define i32 @fn(i8 %a) {
+      %div = udiv i8 %a, poison
+      ...
+   }
+
+The issue with this test is that it triggers immediate UB. This prevents
+verification tools like Alive from validating the correctness of the
+optimization. Hence, it is considered a bad practice to have tests with
+unnecessary immediate UB (unless that is exactly what the test is for).
+The test above should use a dummy function argument instead of using poison:
+
+.. code-block:: llvm
+
+    define i32 @fn(i8 %a, i8 %dummy) {
+      %div = udiv i8 %a, %dummy
+      ...
+   }
+
+Common sources of immediate UB in tests include branching on undef/poison
+conditions and dereferencing undef/poison/null pointers.
+
+.. note::
+   If you need a placeholder value to pass as an argument to an instruction
+   that may trigger UB, add a new argument to the function rather than using
+   undef or poison.
+
+
+Summary
+=======
+Undefined behavior (UB) in LLVM IR consists of two well-defined concepts:
+immediate and deferred UB (undef and poison values).
+Passing deferred UB values to certain operations leads to immediate UB.
+This can be avoided in some cases through the use of the ``freeze``
+instruction.
+
+The lattice of values in LLVM is:
+immediate UB > poison > undef > freeze(poison) > concrete value.
+It is only valid to transform values from the left to the right (e.g., a poison
+value can be replaced with a concrete value, but not the other way around).
+
+Undef is now deprecated and should be used only to represent loads of
+uninitialized memory.