[llvm-dev] RFC: On non 8-bit bytes and the target for it
Robinson, Paul via llvm-dev
llvm-dev at lists.llvm.org
Fri Nov 1 08:43:04 PDT 2019
Did somebody say "PDP-10"? 😊
> David Chisnall raised a question about what to count as a byte
> (which defines the scope of the changes) and we suggest to use
> all 5 criteria he granted:
> > - The smallest unit that can be loaded / stored at a time.
> > - The smallest unit that can be addressed with a raw pointer
> in a specific address space.
> > - The largest unit whose encoding is opaque to anything above
> the ISA.
> > - The type used to represent `char` in C.
> > - The type that has a size that all other types are a multiple
> But if DSPs are less restrictive about byte, some of the criteria
> could be removed.
> 2. Use an iconic target. PDP10 was suggested as a candidate. This
> opinion found support from Tim Northover, Joerg Sonenberger, Mehdi
> AMINI, Philip Reames. It's not clear though does this opinion
> oppose upstreaming non-8-bits byte without tests or just a dummy
> and TVM targets options.
Note that for the PDP-10, not all 5 criteria are the same thing.
It is a word-addressed machine (36-bit words) but the ISA has
instructions to handle 18-bit halfwords, and also defines a
"byte pointer" to allow load/store of arbitrary-size bytes within
a word. Byte pointers allow any size byte that fits in a word
(from 1 bit to 36 bits). So what we have is:
> - The smallest unit that can be loaded / stored at a time.
This is 1 bit, from the ISA's perspective, using byte pointers.
Obviously caches and such would be word-based, but that's not
the point of this criterion.
> - The smallest unit that can be addressed with a raw pointer
in a specific address space.
On PDP-10, the naïve interpretation of "raw pointer" would be
a simple memory address, so this is a 36-bit word. (Halfword
access uses different instructions to move the upper or lower
halfwords; it's not encoded in the address.)
Note that `char *` is not a "raw pointer" in this sense; it is
a byte pointer.
> - The largest unit whose encoding is opaque to anything above
I am not clear what this actually means. I could interpret it
as a double-word floating point, but I doubt that was what was
> - The type used to represent `char` in C.
tl;dr: 7-bit byte.
C is hard to map to PDP-10. DEC did not provide a compiler,
although I was aware of a third-party C compiler; it used 7-bit
ASCII for `char` which was the most typical character size on
that machine. (Sixbit was also used frequently, if you didn't
need lowercase or many special characters, e.g. for filenames.
8-bit ASCII was uncommon, unless you were forced into doing
data transfers to those newfangled PDP-11 and VAX things.)
This means that `char *` and `int *` had different formats, the
former being a byte pointer and the latter being an address;
casting was not free.
> - The type that has a size that all other types are a multiple
Discounting 'char' and strings, I'd have to say this would be
the 36-bit word, i.e. 'int'.
So, in summary, on the PDP-10 a "byte" might be any of:
- one bit
- seven bits
- 18 bits
- 36 bits
depending on what you mean.
Here endeth the lesson. 😊 Let me know if you need any other
DEC employee from 1982-1992
More information about the llvm-dev