[llvm-dev] [RFC] Introducing a byte type to LLVM

Sun Jun 20 08:55:34 PDT 2021

On 13 Jun 2021, at 16:22, Ralf Jung via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...

Union provenance is required if you want an XOR linked list to be valid.  These are pretty rare, but there are some idioms (including the calculation of per-CPU storage in the FreeBSD kernel) that depend on multi-provenance semantics.

CHERI systems, such as the Morello boards that Arm is shipping early next year, provide a hardware-enforced single-provenance semantics, which might provide some inspiration for this discussion:

In 64-bit CHERI implementations, memory capabilities are a 128-bit type protected by a tag bit (in memory and registers) that signifies that it has been derived from one of the capabilities provided in a register on hardware reset.  Any operation that would violate the montonicity of rights (e.g. overwriting a single byte in a valid capability in memory) clears the tag bit, destroying its provenance and causing a trap if you try to use it as the base for a load or store instruction.  When compiling from C-family languages, the memory capability is the hardware type to which all pointer types are lowered.

Our clang port to target these architectures defines a new built-in type, `__intcap_t`, which is used to represent `intptr_t`.  When we emit LLVM IR, we lower this to an LLVM IR pointer type, not an integer type.  All C operations on `__intcap_t` are defined to take provenance from the left operand, with a warning if we can statically show that this is probably wrong.

In our model, at the IR level, `ptrtoint` is fine, but the integer does not carry provenance.  `inttoptr` is not permitted[1].  If code wants to extract an address from a pointer for comparison or for hashing (for example), that’s fine, but it can’t turn the integer back into a pointer directly.  If pointers flow around as integers in the C sources, the may-be-a-pointer-in-C types are lowered to pointer types in the IR.  This would be easier if C arithmetic operations were defined on pointer types.  we currently have to use an IR intrinsic to get the address, then do the arithmetic, then reapply the result to the pointer type, and try to fold that again in the back end.  

We have found that large codebases require a very small amount of porting to support this mode, but they *do* require some.  This is not a 100% compatible mode with existing codebases and so a single-provenance model for LLVM IR (at least, as the only option) would not be acceptable.

David

[1] Well, kind of.  It gives you a capability relative to the default data capability, but in the ABI where all pointers are capabilities then the default data capability is likely to be invalid.  Optimisations may not introduce `inttoptr`.