[cfe-dev] [RFC] Introducing a byte type to LLVM

Fri Jun 4 09:03:36 PDT 2021

Hi George,

On 6/4/21 10:24 AM, George Mitenkov via cfe-dev wrote:
> Hi all,
>
> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
> type to LLVM to fix miscompilations due to load type punning. Please see
> the proposal below. It would be great to hear the
> feedback/comments/suggestions!
>
>
> Motivation
> ==========
>
> char and unsigned char are considered to be universal holders in C. They
> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
> counterpart but it does not have such semantics, which is also not
> desirable as it would disable many optimizations.
>
> We therefore propose to introduce a new byte type that would have a raw-memory
> access semantics and can be differentiated from i8. This can help to fix
> unsound optimizations and make the lowering of char, unsigned char or
> std::byte correct. Currently, we denote the byte type as b<N>, where N is
> the number of bits.
>
> In our semantics, byte type carries provenance and copying bytes does not
> escape pointers, thereby benefiting alias analysis. If the memory contains
> poison bits, then the result of loading a byte from it is bitwise poison.

Could you elaborate on the motivation a bit. I'm not sure I follow.
Maybe examples would be good. The only one I found is in
https://bugs.llvm.org/show_bug.cgi?id=37469. For that one I'm
wondering how much we would in reality loose if we make the required
ptr2int explicit when people cast a pointer to an integer. In general,
explicit encoding seems to me preferable and overall less work (new
types kinda scare me TBH).

~ Johannes

>
>
> Benefits
> ========
>
> The small calls to memcpy and memmove that are transformed by Instcombine
> into integer load/store pairs would be correctly transformed into the
> loads/stores of the new byte type no matter what underlying value is being
> copied (integral value or a pointer value).
>
>     1.
>
>     The new byte type solves the problem of copying the padded data, with no
>     contamination of the loaded value due to poison bits of the padding.
>     2.
>
>     When copying pointers as bytes implicit pointer-to-integer casts are
>     avoided.The current setting of performing a memcpy using i8s leads to
>     miscompilations (example: bug report 37469
>     <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias
>     analysis.
>
>
>
> Lowering of char, unsigned char and std::byte
>
> ======================================
>
> For any function that takes char, unsigned char or std::byte as arguments,
> we lower these to the b8 instead of i8. The motivation comes from the fact
> that any pointer can be explicitly copied in bytes, and each byte can be
> passed as an argument to the copying function. Since a copy like that can
> escape a pointer, we need to generate the byte type to avoid the issue.
>
> Example
>
> void foo(unsigned char arg1, char arg2) {...}
>
>              =>
>
> void @foo(zeroext b8 %arg1, signext b8 %arg2) {...}
>
>
>
> std::byte is defined as enum class byte : unsigned char {} and therefore
> all bitwise operations over it can be lowered equivalently to operations on
> unsigned char (operation is performed over zero-extended to i32 operands,
> and the result is truncated back).
>
>
> ==============================
>
> Byte Type Semantics (Version 1)
>
> ==============================
>
>
> [allowed] alloca/load/store
>
> =====================
>
> The byte type is allowed to be allocated, stored and loaded. No semantic
> changes needed.
>
> [allowed] zext/sext/trunc
>
> ====================
>
> In order to easily adapt the current replacement of i8 with b8, we want to
> extend zext/sext/trunc instructions’ semantics to allow the first operand
> as a byte type as well. This will reduce the number of instructions needed
> for the cast.
>
> Example
>
> trunc i32 %int to b8
>
> zext i8 %char to b32
>
> [modified] bitcast
>
> ==============
>
> We modify the semantics of the bitcast to allow casts from and to the byte
> types. If the byte has a value of the same kind (integral or a pointer) as
> the other type, then the bitcast is a noop. Otherwise, the result is poison.
>
> Example
>
> bitcast i<N> %val to b<N>
>
> bitcast i<N>* %val to b64
>
> bitcast b<N> %byte to i<N>     [byte has an integral value => noop]
>
>
>
> [new] bytecast
>
> ============
>
> During IR generation, we cannot deduce whether the byte value contains a
> pointer or an integral value, and therefore we cannot create a bitcast.
>
> Example
>
> int @cast(char c) { return (int) c; }
>
> Current
>
> Proposed
>
> i32 @cast(i8 signext %c) {
>
>    %1 = alloca i8, align 1
>
>    store i8 %c, i8* %1, align 1
>
>    %2 = load i8, i8* %1, align 1
>
>    %3 = sext i8 %2 to i32
>
>    ret i32 %3
>
> }
>
> i32 @cast(b8 %c) {
>
>    %1 = alloca b8
>
>    store b8 %c, b8* %11
>
>    %2 = load b8, b8* %1
>
>    %3 =    ?    b8 %2 to i32
>
>    ret i32 %3
>
> }
>
> In this example, the operation  ?  can be sext (b8 is i8) or can be
> ptrtoint if the byte has a pointer value.
>
> We therefore introduce a new bytecast instruction. The frontend will always
> produce a bytecast, which can then be optimized into a more specific cast
> if necessary. Bytecast operands must have the same bitwidth (or size).
>
> The semantics of bytecast instruction is:
>
> bytecast b<N> %byte to T
>
> The kind of the value of the byte
>
> T
>
> Semantics
>
> Pointer
>
> Integral
>
> ptrtoint
>
> Pointer
>
> Pointer
>
> bitcast
>
> Integral
>
> Integral
>
> integral casts (zext, sext, etc.)
>
> Integral
>
> Pointer
>
> inttoptr
>
>
>
> Essentially, this instruction is a selection of the right casts. Once the
> compiler has been able to prove the kind of the byte’s value, the bytecast
> can be replaced with appropriate cast or become a noop.
>
> Example
>
> bytecast b64 %bytes to i32*
>
> bytecast b8 %byte to i8
>
>
>
> [disabled] and/or/xor
>
> =================
>
> We do not allow bitwise operations over byte types because it’s not trivial
> to define all cases, particularly the ones involving pointers. If these
> operations are useful for optimizations and/or some frontends, semantics
> for these can be considered separately.
>
> If the operation is desired on the frontend level, then the default
> generated code can always cast the byte type to an integer to operate on
> integer values.
>
>
>
> [disabled] arithmetic
>
> =================
>
> Performing arithmetic operations over the byte type is similarly not
> allowed (A valid question is what does it mean to add to bytes of raw
> memory?). If we want to perform arithmetic, we need to cast a byte to an
> integer (via sext/zext/trunc explicitly).
>
>
>
> [modified] comparison
>
> ==================
>
> We allow performing comparisons, as we may potentially want to compare the
> ordering of the memory instances, check for null, etc. Comparison is also
> needed since char types are commonly compared. We define the following
> semantic of the byte type comparison.
>
> Case 1: The values of the bytes have the same kinds: compare the values.
>
> Case 2: The values of the bytes have the different kinds: do a bytecast of
> the non-integral type to the other type and compare the integral values.
>
>
> Example
>
> =======
>
> unsigned char sum(unsigned char a, unsigned char b) { return a + b; }
>
> Current
>
> Proposed
>
> zeroext i8 @sum(i8 zeroext %a, i8 zeroext %b) {
>
>    %1 = alloca i8
>
>    %2 = alloca i8
>
>    store i8 %a, i8* %1
>
>    store i8 %b, i8* %2
>
>    %3 = load i8, i8* %1
>
>    %4 = zext i8 %3 to i32
>
>    %5 = load i8, i8* %2
>
>    %6 = zext i8 %5 to i32
>
>    %7 = add nsw i32 %4, %6
>
>    %8 = trunc i32 %7 to i8
>
>    ret i8 %8
>
> }
>
> zeroext b8 @sum(b8 zeroext %a, b8 zeroext %b) {
>
>    %1 = alloca b8
>
>    %2 = alloca b8
>
>    store b8 %a, b8* %1
>
>    store b8 %b, b8* %2
>
>    %3 = load b8, b8* %1
>
>    %4 = bytecast b8 %3 to i8
>
>    %5 = zext i8 %4 to i32
>
>    %6 = load b8, b8* %2
>
>    %7 = bytecast b8 %6 to i8
>
>    %8 = zext i8 %7 to i32
>
>    %9 = add nsw i32 %5, %8
>
>    %10 = trunc i32 %9 to b8
>
>    ret b8 %10
>
> }
>
>
>
>
>
> Thanks,
>
> George
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev