[LLVMdev] Verifying Toolchain Semantics

Ian Grant ian.a.n.grant at googlemail.com
Thu Oct 2 07:54:20 PDT 2014


Dear programming language types,

I wrote this to try once again to explain what is the nature of the
problem that one would have in verifying the integrity of _any_
software toolchain, whether it is aimed ultimately at the production
of other software, or of hardware.

    http://livelogic.blogspot.com/2014/10/the-foundation-part-i.html

This three page text is ostensively about verifying the integrity of a
communications link operating over an un-trusted transport layer, but
a compiler is really a type of communications channel.

I am sure everyone still reading this has wondered about the
possibilities of using universal programming languages (universal in
the Church-Turing sense) as communications protocols. For example, one
could establish a point-to-point connection by writing a program which,
when run, output two more programs: one which, when run, outputs a
decoding pad for the next mesage one would transmit over that channel,
and the other the decoder which prints the message text together with
another program, which was the encoder for returning an
acknowledgement. Both endpoints would do this, and so the
programs/messages would be exchanged, with each one encoding the text
of the other.  Then these programs could include decisions not only
about the encoding of data, choice of one-time pads, etc. but perhaps
also the routes of messages, sending different parts via different
trusted routes over similar channels etc. etc. The variations are
endless, and limited only by the ingenuity of the programmers
communicating over those channels.

And really, I sorely pity anyone charged with organising any kind of
surveillance of a group of people who enjoy that sort of
game. Cracking "the code" would be practically impossible, because
there need never be any fixed concrete representation whatsoever of
the fundamental encoding as it is transmitted over the transport
medium: _all_ of the knowledge about the current and future encodings
can be sent over the previous "incarnations" of that and/or another
channel, and the encoding of channels thereby made non-deterministic:
this means that there could never be _in principle,_ any mechanical
process _whatsoever_ which could decode more than a few parts of any
of those messages. After this brief success, the poor would-be spy
would be right back at square one.

What I try to explain here is the essential distinction between what I
call _actual knowledge,_ as opposed to mere _represented knowledge,_
such as a password, or an SSL certificate, or the documentation for
some file format appearing on a web page. The distinction is that only
in the case of actual knowledge does one know _how_ and _why_ one
knows.

The motivation is the idea that by using actual rather than
represented knowledge, it is possible to construct such a trustworthy
system in practice. But there's a catch! The catch is that this will
only work for an organisation whose motives and governance are
completely open and transparent. This is because the technique relies
upon mutual trust, which is something that cannot exist without
openness and transparency. Bad guys just won't get it! To understand
why (in case it is not immediately obvious to you, that is) you will
need to read (or at least think carefully about) about how
error-detection would work in such a system.

The text consists of a title page with the abstract, and two full
pages. So it should be an easy read. I earlier sent out a nine page
document entitled GNU Thunder, in which I attempted to describe what I
consider to be essentially the same idea, but with the emphasis on
compilers and interpreters, rather than communications. The Thunder
text was a concrete suggestion for an implementation. This text
however is more abstract. But these two documents could be
considered to be complementary in two different senses.

I hope everyone enjoys this, and that it stimulates some interesting
thoughts, and subsequent discussion, and that those thoughts are all
good, and the discussion open and transparent. That way we could save
ourselves an awful lot of _really hairy_ metaprogramming!

Please feel free to copy either text and/or this message and pass it
around. Neither of these two texts are copyright, and the more people
that see them the better. Bad guys in particular need to know about
this much more than the good ones do.

Ian



More information about the llvm-dev mailing list