[llvm-dev] RFC: On non 8-bit bytes and the target for it

Robinson, Paul via llvm-dev llvm-dev at lists.llvm.org
Fri Nov 1 08:43:04 PDT 2019


Did somebody say "PDP-10"?  😊

> David Chisnall raised a question about what to count as a byte
> (which defines the scope of the changes) and we suggest to use
> all 5 criteria he granted:
> > - The smallest unit that can be loaded / stored at a time.
> > - The smallest unit that can be addressed with a raw pointer
>     in a specific address space.
> > - The largest unit whose encoding is opaque to anything above
>     the ISA.
> > - The type used to represent `char` in C.
> > - The type that has a size that all other types are a multiple
>     of.
> But if DSPs are less restrictive about byte, some of the criteria
> could be removed.
>
> 2. Use an iconic target. PDP10 was suggested as a candidate. This
> opinion found support from Tim Northover, Joerg Sonenberger, Mehdi
> AMINI, Philip Reames. It's not clear though does this opinion
> oppose upstreaming non-8-bits byte without tests or just a dummy
> and TVM targets options.

Note that for the PDP-10, not all 5 criteria are the same thing.
It is a word-addressed machine (36-bit words) but the ISA has
instructions to handle 18-bit halfwords, and also defines a 
"byte pointer" to allow load/store of arbitrary-size bytes within 
a word.  Byte pointers allow any size byte that fits in a word 
(from 1 bit to 36 bits).  So what we have is:

> - The smallest unit that can be loaded / stored at a time.

This is 1 bit, from the ISA's perspective, using byte pointers.
Obviously caches and such would be word-based, but that's not
the point of this criterion.

> - The smallest unit that can be addressed with a raw pointer
    in a specific address space.

On PDP-10, the naïve interpretation of "raw pointer" would be
a simple memory address, so this is a 36-bit word.  (Halfword 
access uses different instructions to move the upper or lower 
halfwords; it's not encoded in the address.)
Note that `char *` is not a "raw pointer" in this sense; it is
a byte pointer.

> - The largest unit whose encoding is opaque to anything above
    the ISA.

I am not clear what this actually means.  I could interpret it
as a double-word floating point, but I doubt that was what was
intended.

> - The type used to represent `char` in C.

tl;dr: 7-bit byte.

C is hard to map to PDP-10. DEC did not provide a compiler,
although I was aware of a third-party C compiler; it used 7-bit 
ASCII for `char` which was the most typical character size on 
that machine.  (Sixbit was also used frequently, if you didn't
need lowercase or many special characters, e.g. for filenames.
8-bit ASCII was uncommon, unless you were forced into doing
data transfers to those newfangled PDP-11 and VAX things.)
This means that `char *` and `int *` had different formats, the
former being a byte pointer and the latter being an address;
casting was not free.

> - The type that has a size that all other types are a multiple
    of.

Discounting 'char' and strings, I'd have to say this would be
the 36-bit word, i.e. 'int'.

So, in summary, on the PDP-10 a "byte" might be any of:
- one bit
- seven bits
- 18 bits
- 36 bits
depending on what you mean.

Here endeth the lesson. 😊 Let me know if you need any other
historical trivia.

--paulr
DEC employee from 1982-1992



More information about the llvm-dev mailing list