[llvm-dev] RFC: On non 8-bit bytes and the target for it
David Chisnall via llvm-dev
llvm-dev at lists.llvm.org
Wed Oct 30 08:18:56 PDT 2019
On 30/10/2019 10:07, Jeroen Dobbelaere via llvm-dev wrote:
> We (Synopsys ASIP Designer team) and our customers tend to disagree: our customers do create plenty of cpu architectures
> with non-8-bit characters (and non-8-bit addressable memories). We are able to provide them with a working c/c++ compiler solution.
> Maybe some support libraries are not supported out of the box, but for these kind of architectures that is acceptable.
> (Besides that, llvm is also more than just c/c++)
My main concern in this discussion is that we're conflating several
concepts of a 'byte':
- The smallest unit that can be loaded / stored at a time.
- The smallest unit that can be addressed with a raw pointer in a
specific address space.
- The largest unit whose encoding is opaque to anything above the ISA.
- The type used to represent `char` in C.
- The type that has a size that all other types are a multiple of.
In POSIX C (which imposes some extra constraints not found in ISO C),
when lowered to LLVM IR, all of these are the same type:
- Loads and stores of values smaller than i8 or not a multiple of i8
may be widened to a multiple of i8. Bitfield fields that are smaller
than i8 must use i8 or wider operations and masking.
- GEP indexes are not well defined for anything that is not a multiple
- There is no defined bit order of i8 (or bit order for larger types,
only an assumption that, for example, i32 is 4 i8s in a specific order
specified by the data layout).
- char is lowered to i8.
- All ABI-visible types have a size that is a multiple of 8 bits.
It's not clear to me that saying 'a byte is 257 bits' means changing all
of these to 257 or changing only some of them to 257 (which?). For
example, when compiling C for 16-byte-addressible historic
- char is 8 bytes.
- char* and void* is represented as a pointer plus a 1-bit offset
(sometimes encoded in the low bit, so the load / store sequence is a
right shift one, a load, and then a mask or mask and shift depending on
the low bit).
- Other pointer types are 16-bit aligned.
IBM's 36-bit word machines use a broadly similar strategy, though with
some important differences and I would imagine that most Synopsis cores
are going to use some variation on this approach.
This probably involves a quite different design to a model with 257-bit
registers, but most of the concerns don't exist if you don't have memory
that can store byte arrays and so involve very different design decisions.
TL;DR: A proposal for supporting non-8-bit bytes needs to explain what
their expected lowerings are and what they mean by a byte.
More information about the llvm-dev