[llvm-dev] RFC: On non 8-bit bytes and the target for it

Sat Nov 2 00:23:15 PDT 2019

On Fri, Nov 1, 2019 at 8:43 AM Robinson, Paul via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Did somebody say "PDP-10"?  😊
>
> > David Chisnall raised a question about what to count as a byte
> > (which defines the scope of the changes) and we suggest to use
> > all 5 criteria he granted:
> > > - The smallest unit that can be loaded / stored at a time.
> > > - The smallest unit that can be addressed with a raw pointer
> >     in a specific address space.
> > > - The largest unit whose encoding is opaque to anything above
> >     the ISA.
> > > - The type used to represent `char` in C.
> > > - The type that has a size that all other types are a multiple
> >     of.
> > But if DSPs are less restrictive about byte, some of the criteria
> > could be removed.
> >
> > 2. Use an iconic target. PDP10 was suggested as a candidate. This
> > opinion found support from Tim Northover, Joerg Sonenberger, Mehdi
> > AMINI, Philip Reames. It's not clear though does this opinion
> > oppose upstreaming non-8-bits byte without tests or just a dummy
> > and TVM targets options.
>
> Note that for the PDP-10, not all 5 criteria are the same thing.
> It is a word-addressed machine (36-bit words) but the ISA has
> instructions to handle 18-bit halfwords, and also defines a
> "byte pointer" to allow load/store of arbitrary-size bytes within
> a word.  Byte pointers allow any size byte that fits in a word
> (from 1 bit to 36 bits).  So what we have is:
>
> > - The smallest unit that can be loaded / stored at a time.
>
> This is 1 bit, from the ISA's perspective, using byte pointers.
> Obviously caches and such would be word-based, but that's not
> the point of this criterion.
>
> > - The smallest unit that can be addressed with a raw pointer
>     in a specific address space.
>
> On PDP-10, the naïve interpretation of "raw pointer" would be
> a simple memory address, so this is a 36-bit word.  (Halfword
> access uses different instructions to move the upper or lower
> halfwords; it's not encoded in the address.)
> Note that `char *` is not a "raw pointer" in this sense; it is
> a byte pointer.
>
> > - The largest unit whose encoding is opaque to anything above
>     the ISA.
>
> I am not clear what this actually means.  I could interpret it
> as a double-word floating point, but I doubt that was what was
> intended.
>
> > - The type used to represent `char` in C.
>
> tl;dr: 7-bit byte.
>
> C is hard to map to PDP-10. DEC did not provide a compiler,
> although I was aware of a third-party C compiler; it used 7-bit
> ASCII for `char` which was the most typical character size on
> that machine.  (Sixbit was also used frequently, if you didn't
> need lowercase or many special characters, e.g. for filenames.
> 8-bit ASCII was uncommon, unless you were forced into doing
> data transfers to those newfangled PDP-11 and VAX things.)
> This means that `char *` and `int *` had different formats, the
> former being a byte pointer and the latter being an address;
> casting was not free.
>
> > - The type that has a size that all other types are a multiple
>     of.
>
> Discounting 'char' and strings, I'd have to say this would be
> the 36-bit word, i.e. 'int'.
>

Fascinating.

So, a 36-bit word could contain 6 Sixbits, 5 7-bit ASCII characters, or 4
8-bit ASCII characters for communicating with later DEC machines?

I was going to ask what the compiler does when it sees "Hello World"... but
since DEC didn't provide a compiler, I suppose there can't be an answer to
that...

I would say that it's critical for memcpy to work well enough that it
copies all the bits, which to me means that the size of a "word" has to be
a multiple of whatever 'char' is.  That rules out both 8-bit chars and
7-bit chars.  I would say your only choices are:

1 bit
6 bits
9 bits
36 bits

3, 4, 12, and 18 also evenly divide 36, but I don't see any compelling
reason to want them.

A 9-bit char would have some use if 8-bit characters were only packed
4-to-a-word.

A 1-bit char would be awesome because then you might end up with the only
architecture in the world where vector<bool> wasn't an abomination.  I
guess then that "Hello World" might have a size of 77 chars (84 counting
the NUL), assuming that the compiler treated 7-bit as the preferred
encoding.

...

Thanks for the lesson.  I have a very dim recollection of programming a PDP
in college... apparently blissfully unaware of word sizes... which makes me
think it was probably an 11/70.

-- Jorg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191102/387220d3/attachment.html>