[llvm-dev] RFC: On non 8-bit bytes and the target for it

Wed Oct 30 08:18:56 PDT 2019

On 30/10/2019 10:07, Jeroen Dobbelaere via llvm-dev wrote:
> We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
> with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
> Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.
> (Besides that, llvm is also more than just c/c++)

My main concern in this discussion is that we're conflating several 
concepts of a 'byte':

  - The smallest unit that can be loaded / stored at a time.

  - The smallest unit that can be addressed with a raw pointer in a 
specific address space.

  - The largest unit whose encoding is opaque to anything above the ISA.

  - The type used to represent `char` in C.

  - The type that has a size that all other types are a multiple of.

In POSIX C (which imposes some extra constraints not found in ISO C), 
when lowered to LLVM IR, all of these are the same type:

  - Loads and stores of values smaller than i8 or not a multiple of i8 
may be widened to a multiple of i8.  Bitfield fields that are smaller 
than i8 must use i8 or wider operations and masking.

  - GEP indexes are not well defined for anything that is not a multiple 
of i8.

  - There is no defined bit order of i8 (or bit order for larger types, 
only an assumption that, for example, i32 is 4 i8s in a specific order 
specified by the data layout).

  - char is lowered to i8.

  - All ABI-visible types have a size that is a multiple of 8 bits.

It's not clear to me that saying 'a byte is 257 bits' means changing all 
of these to 257 or changing only some of them to 257 (which?).  For 
example, when compiling C for 16-byte-addressible historic 
architectures, typically:

  - char is 8 bytes.

  - char* and void* is represented as a pointer plus a 1-bit offset 
(sometimes encoded in the low bit, so the load / store sequence is a 
right shift one, a load, and then a mask or mask and shift depending on 
the low bit).

  - Other pointer types are 16-bit aligned.

IBM's 36-bit word machines use a broadly similar strategy, though with 
some important differences and I would imagine that most Synopsis cores 
are going to use some variation on this approach.

This probably involves a quite different design to a model with 257-bit 
registers, but most of the concerns don't exist if you don't have memory 
that can store byte arrays and so involve very different design decisions.

TL;DR: A proposal for supporting non-8-bit bytes needs to explain what 
their expected lowerings are and what they mean by a byte.

David