[cfe-dev] Typed memory (split from byte type)

Mon Jun 14 02:03:40 PDT 2021

TLDR, we should tag load/store to indicate provenance tracking or
disregarding, not add a new integer type.

Treading carefully here as this is running close to a religious issue -
I've seen morality invoked as an argument against reinterpret_cast, though
thankfully not on this mailing list.

A comment from the introducing a byte type thread.

> Last week Alive2 caught a miscompilation in the Linux kernel, in the
> network stack. The optimization got the pointer arithmetic wrong. Pretty
scary,
> and the bug may have security implications.

Typed memory seems a reasonable consequence of the pointer provenance model
C++ is pursuing (hard to judge whether WG14 is going the same way). It is
probably a reasonable model for clang++ (maybe clang) to work with on those
grounds.

Some of the things that typed memory would rule out are mmap data
structures from disk (e.g. elf, hash tables) and ptrtoint trickery (NaN
boxing, pointer tagging. Losing mmap of an elf makes toolchains slower.
Treating NaN boxing or pointer tagging as UB means dynamic languages need
to find a new host. Network code is infamous for reading raw bytes off the
wire. Implementing parts of libm involves reading the
mantissa/exponent/sign bit of IEEE floats. Pragmatism will therefore
motivate an escape hatch, like memcpy was on untyped memory.

Assume for a moment that LLVM manages to represent typed memory and untyped
memory, in some fashion that remains internally consistent. Clang can then
mostly emit IR using typed memory while the escape hatches (memcpy,
bitcast, I haven't kept up with the implicit object creation proposals)
emit IR using untyped memory.

Languages that want to be machine-like can emit untyped memory IR and ones
that want to be highlevel-like can emit typed memory IR, with ones that try
to do both emitting a mixture.

This sounds like a different load/store instruction to me, not a different
type. It's still an 'i32' once it's in an SSA variable. More likely a tag
on the load/store to indicate whether pointer provenance tracks through it
or not, since it seems likely we can relax the 'typed' version to 'untyped'
before hitting codegen.

There is some prior art here on atomic. From the machine perspective,
'atomic' is obviously a property of the instruction. From the C++
perspective, 'atomic' is defined as a property of the type. We seem to
handle that difference in perspective well enough so we can probably handle
untyped/typed memory at the boundary to memory, which is mostly load/store.

Thanks all,

Jon

p.s. I'd much rather we throw out pointer provenance and go back to the
good old days where no-strict-aliasing was just how things work because I
seem to write code in domains that collides with that edge a lot, mostly in
toolchains.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210614/47409ded/attachment.html>