[PATCH] D64128: [CodeGen] Generate llvm.ptrmask instead of inttoptr(and(ptrtoint, C)) if possible.

Wed Jul 3 18:37:43 PDT 2019

hfinkel added a comment.

In D64128#1569590 <https://reviews.llvm.org/D64128#1569590>, @efriedma wrote:

> > If they're all syntactically together like this, maybe that's safe?
>
> Having them together syntactically doesn't really help, I think; it might be guarded by some code that does the same conversion (and if you repeat the conversion, it has to produce the same result).

Indeed. That's correct (and also why the hasOneUse check at the IR level would have been ineffective). However...

In D64128#1569578 <https://reviews.llvm.org/D64128#1569578>, @rjmccall wrote:

> I agree with Eli that this isn't obviously a legal transformation.  `llvm.ptrmask` appears to make semantic guarantees about e.g. the pointer after the mask referring to the same underlying object, which means we can only safely emit it when something about the source program makes that guarantee.  It's not at all clear that C does so for an expression like `(T*) ((intptr_t) x & N)`.

I think that this is the key point. First, at the IR level we have a problem because we have no way to robustly track pointer provenance information. If we have `if (a == b) { f(a); }` the optimizer can transform this code into `if (a == b) { f(b); }` and we've lost track of whether the parameter to f is based on a or b. At the source level we don't have this problem (because we have the unaltered expressions provided by the user, and can therefore use whatever provenance information that source implies).

Thus, as John says, the question is whether, at the source level, `(T*) ((intptr_t) x & N)` always has, and only has, the same underlying objects as x - when executing the expression is well defined. In C++, I think that this is clearly true for implementations with "strict pointer safety" (6.6.5.4.3), as the rules for safely-derived pointer values state that, while you can get safely-derived pointer values using integer casts and bitwise operators, the result must be one that could have been safely derived from the original object using well-defined pointer arithmetic, and that's only true for pointers into some array pointed into by x (or one past the end). For implementations with "relaxed pointer safety", it's all implementation defined, so I don't see we couldn't choose our implementation-defined semantics to define this problem away (although we certainly need to be careful that we don't unintentionally make any significant body of preexisting code incompatible with Clang by doing so).

For C, we also need to be concerned with the definition of "based on" (6.7.3.1). In some philosophical sense, this seems trickier (i.e., what if modifying the value of x at some sequence point prior to the expression makes the expressions dead? Are we required, as part of the standardized through experiment, to also modify the other variables to keep the expression alive when performing the "based on" analysis, and do those modifications count for the purposes of determining the "based on" property?). Regardless, given that the intent is to enable optimizations, it seems reasonable to say that `(T*) ((intptr_t) x & N)` is only based on x. For C, 6.3.2.3 makes the conversion validity itself implementation defined.

@rsmith , thoughts on this?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64128/new/

https://reviews.llvm.org/D64128