[llvm-dev] [RFC] Introducing a byte type to LLVM

Tue Jun 22 12:40:26 PDT 2021

Hi John,

>     Note that "provenance" as we use it in this discussion is an *explicit
>     operational artifact* -- it exists as a concrete piece of state in the
>     Abstract Machine. That is very different from something that might just be
>     used internally in some kind of analysis.
> 
>     There is no problem with "resetting" that provenance on a "inttoptr", and
>     basically forgetting about where the int comes from. Note that this is a
>     statement about an operation in the Abstract Machine, not merely a statement
>     about some analysis: this is not "forgetting" as in "safely
>     overapproximating the real set of possible behaviors", it is "forgetting" by
>     *forcing* the provenance to be some kind of wildcard/default provenance. All
>     analyses then have to correctly account for that.
> 
> But it’s not a truly wildcard provenance. At the very least, it’s
> restricted to the set of provenances that have been exposed, and
> my understanding is that Juneyoung’s non-determinism rule attempts
> to readmit a use-dependency analysis and turn |ptrtoint| back into
> a simple scalar operation.

So, at least in the paper Juneyoung and me and a few other people wrote (the one 
referenced a couple times in this thread already: 
https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf), it is a wildcard provenance.
"exposed" is not a thing in the operational semantics, it is a theorem we can prove.

> 
>         For example, you have compellingly argued that it’s problematic to
>         do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
>         I instead do this optimization on integers where |a = ptrtoint A|.
>         The result is now simply |b|. If I |inttoptr| that result and access
>         the memory, there will be no record that that access may validly
>         be to |A|. It does not help that the access may be represented
>         as |inttoptr (ptrtoint B)| for some |B| rather than just directly
>         to |B|, because there is no use-dependence on |A|. All there is
>         is an apparently unrelated and unused |ptrtoint A|.
> 
>     So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B" being
>     replaced by "ptrtoint B"? I don't see any problem with that. Do you have a
>     concrete example?
> 
> I think you can take any example where pointer-type restrictions
> would be necessary to protect against miscompilation and turn it
> into an example here by just inserting |inttoptr| and |ptrtoint|
> appropriately. Quick example:
> 
> |int A = 0x1; int B = 0x2; long a = (long) (A+1); long b = (long) B; long result 
> = (a == b ? a : b); if (a == b) *(((int*) result) - 1) = 0x4; else *((int*) 
> result) = 0x8; printf(“%d %d\n”, A, B); |

(Sorry about the bad quote, not sure what Thunderbird does here.)
I assume you mean "&A" and "&B" in lines 3 and 4?

> I submit that this program has unspecified but not undefined behavior,
> with printing “1 8” and “4 2” being the only two valid outcomes.

Okay, I can follow that.

> But I think an optimizer which changes the fifth line to
> |long result = b;| without leaving any trace behind

so then we are at

int A = 0x1;
int B = 0x2;
long a = (long) (&A+1);
long b = (long) &B;
long result = b;
if (a == b)
   *(((int*) result) - 1) = 0x4;
else
   *((int*) result) = 0x8;
printf(“%d %d\n”, A, B);

> could easily
> compile this to print “1 2” because there would be nothing to
> prevent the initialization of |A| from being forwarded to the
> final load.

I don't think that's right, since "a" is still used in the "if", so a bit of 
information about the address is leaked and you can't assume the address was not 
guessed. The furthest you can go is

int A = 0x1;
int B = 0x2;
long a = (long) (&A+1);
long b = (long) &B;
if (a == b)
   *(((int*) (long) &B) - 1) = 0x4;
else
   *((int*) (long) &B) = 0x8;
printf(“%d %d\n”, A, B);

No need for an explicit 'exposed', I think.

That said, things get more tricky once you want to account for C 'restrict' / 
LLVM 'noalias', and then it might be necessary to have explicit 'exposed'-like 
operations. I haven't seen a properly worked-out model for this, but the 
candidate models I saw would need something like this. So maybe it'd not be a 
bad idea to have such an operation anyway... I just don't think it should have 
any effect for the kind of programs we have been discussing here so far.

Kind regards,
Ralf

> 
> You can prevent this by noting that the provenance of |A| has been
> xposed and allowing the |inttoptr| of |result| to alias |A|, but
> that seems inconsistent with treating |ptrtoint| as a simple scalar
> operation and allowing a use-analysis of a |ptrtoint| to restrict
> which |inttoptr| casts are allowed to recreate provenance for the
> |ptrtoint| operand.
> 
> If you want to keep treating |ptrtoint| as a scalar operation and
> doing use-analyses on it, I think the most palatable option is to
> recognize whenever you’re cutting a use-dependency and
> conservatively record in IR that the original value has now been
> exposed. So if you start with this:
> 
> |%eq = icmp eq i32 %a, %b %result = select i1 %eq, i32 %a, i32 %b |
> 
> You have to transform it like this:
> 
> |%result = %b call void @llvm.expose.i32(i32 %a) |
> 
> You should be able to remove these exposure events in a lot of
> situations, but conservatively they’ll have to be treated as
> escapes.
> 
> Most optimizations never cut use-dependencies on opaque values
> like this and so won’t be affected.
> 
> John.
> 

-- 
Website: https://people.mpi-sws.org/~jung/