[llvm-dev] [RFC] Introducing a byte type to LLVM

Fri Jun 4 08:24:11 PDT 2021

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!

Motivation
==========

char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

We therefore propose to introduce a new byte type that would have a raw-memory
access semantics and can be differentiated from i8. This can help to fix
unsound optimizations and make the lowering of char, unsigned char or
std::byte correct. Currently, we denote the byte type as b<N>, where N is
the number of bits.

In our semantics, byte type carries provenance and copying bytes does not
escape pointers, thereby benefiting alias analysis. If the memory contains
poison bits, then the result of loading a byte from it is bitwise poison.

Benefits
========

The small calls to memcpy and memmove that are transformed by Instcombine
into integer load/store pairs would be correctly transformed into the
loads/stores of the new byte type no matter what underlying value is being
copied (integral value or a pointer value).

   1.

   The new byte type solves the problem of copying the padded data, with no
   contamination of the loaded value due to poison bits of the padding.
   2.

   When copying pointers as bytes implicit pointer-to-integer casts are
   avoided.The current setting of performing a memcpy using i8s leads to
   miscompilations (example: bug report 37469
   <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias
   analysis.

Lowering of char, unsigned char and std::byte

======================================

For any function that takes char, unsigned char or std::byte as arguments,
we lower these to the b8 instead of i8. The motivation comes from the fact
that any pointer can be explicitly copied in bytes, and each byte can be
passed as an argument to the copying function. Since a copy like that can
escape a pointer, we need to generate the byte type to avoid the issue.

Example

void foo(unsigned char arg1, char arg2) {...}

            =>

void @foo(zeroext b8 %arg1, signext b8 %arg2) {...}

std::byte is defined as enum class byte : unsigned char {} and therefore
all bitwise operations over it can be lowered equivalently to operations on
unsigned char (operation is performed over zero-extended to i32 operands,
and the result is truncated back).

==============================

Byte Type Semantics (Version 1)

==============================

[allowed] alloca/load/store

=====================

The byte type is allowed to be allocated, stored and loaded. No semantic
changes needed.

[allowed] zext/sext/trunc

====================

In order to easily adapt the current replacement of i8 with b8, we want to
extend zext/sext/trunc instructions’ semantics to allow the first operand
as a byte type as well. This will reduce the number of instructions needed
for the cast.

Example

trunc i32 %int to b8

zext i8 %char to b32

[modified] bitcast

==============

We modify the semantics of the bitcast to allow casts from and to the byte
types. If the byte has a value of the same kind (integral or a pointer) as
the other type, then the bitcast is a noop. Otherwise, the result is poison.

Example

bitcast i<N> %val to b<N>

bitcast i<N>* %val to b64

bitcast b<N> %byte to i<N>     [byte has an integral value => noop]

[new] bytecast

============

During IR generation, we cannot deduce whether the byte value contains a
pointer or an integral value, and therefore we cannot create a bitcast.

Example

int @cast(char c) { return (int) c; }

Current

Proposed

i32 @cast(i8 signext %c) {

  %1 = alloca i8, align 1

  store i8 %c, i8* %1, align 1

  %2 = load i8, i8* %1, align 1

  %3 = sext i8 %2 to i32

  ret i32 %3

}

i32 @cast(b8 %c) {

  %1 = alloca b8

  store b8 %c, b8* %11

  %2 = load b8, b8* %1

  %3 =    ?    b8 %2 to i32

  ret i32 %3

}

In this example, the operation  ?  can be sext (b8 is i8) or can be
ptrtoint if the byte has a pointer value.

We therefore introduce a new bytecast instruction. The frontend will always
produce a bytecast, which can then be optimized into a more specific cast
if necessary. Bytecast operands must have the same bitwidth (or size).

The semantics of bytecast instruction is:

bytecast b<N> %byte to T

The kind of the value of the byte

T

Semantics

Pointer

Integral

ptrtoint

Pointer

Pointer

bitcast

Integral

Integral

integral casts (zext, sext, etc.)

Integral

Pointer

inttoptr

Essentially, this instruction is a selection of the right casts. Once the
compiler has been able to prove the kind of the byte’s value, the bytecast
can be replaced with appropriate cast or become a noop.

Example

bytecast b64 %bytes to i32*

bytecast b8 %byte to i8

[disabled] and/or/xor

=================

We do not allow bitwise operations over byte types because it’s not trivial
to define all cases, particularly the ones involving pointers. If these
operations are useful for optimizations and/or some frontends, semantics
for these can be considered separately.

If the operation is desired on the frontend level, then the default
generated code can always cast the byte type to an integer to operate on
integer values.

[disabled] arithmetic

=================

Performing arithmetic operations over the byte type is similarly not
allowed (A valid question is what does it mean to add to bytes of raw
memory?). If we want to perform arithmetic, we need to cast a byte to an
integer (via sext/zext/trunc explicitly).

[modified] comparison

==================

We allow performing comparisons, as we may potentially want to compare the
ordering of the memory instances, check for null, etc. Comparison is also
needed since char types are commonly compared. We define the following
semantic of the byte type comparison.

Case 1: The values of the bytes have the same kinds: compare the values.

Case 2: The values of the bytes have the different kinds: do a bytecast of
the non-integral type to the other type and compare the integral values.

Example

=======

unsigned char sum(unsigned char a, unsigned char b) { return a + b; }

Current

Proposed

zeroext i8 @sum(i8 zeroext %a, i8 zeroext %b) {

  %1 = alloca i8

  %2 = alloca i8

  store i8 %a, i8* %1

  store i8 %b, i8* %2

  %3 = load i8, i8* %1

  %4 = zext i8 %3 to i32

  %5 = load i8, i8* %2

  %6 = zext i8 %5 to i32

  %7 = add nsw i32 %4, %6

  %8 = trunc i32 %7 to i8

  ret i8 %8

}

zeroext b8 @sum(b8 zeroext %a, b8 zeroext %b) {

  %1 = alloca b8

  %2 = alloca b8

  store b8 %a, b8* %1

  store b8 %b, b8* %2

  %3 = load b8, b8* %1

  %4 = bytecast b8 %3 to i8

  %5 = zext i8 %4 to i32

  %6 = load b8, b8* %2

  %7 = bytecast b8 %6 to i8

  %8 = zext i8 %7 to i32

  %9 = add nsw i32 %5, %8

  %10 = trunc i32 %9 to b8

  ret b8 %10

}

Thanks,

George
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/606aff9b/attachment.html>