[PATCH] An analysis to find external function pointers and trace their data flow

Thu Mar 13 13:29:37 PDT 2014

================
Comment at: lib/Analysis/ExternalFunctionAnalysis.cpp:236
@@ +235,3 @@
+                       << " efa-maybe-external pointer gets stored into another"
+                       << " pointer.\n");
+        } else if (S->getPointerOperand() == V) {
----------------
JF Bastien wrote:
> I'm not sure I understand what the bad implication is here, could you improve the warning to explain? Is it a kind of taint tracking thing, where it hard to know which maybe-external the pointer now refers to?
Yes, it's a matter of the precision of the analysis: this means that a maybe-external has been stored into a pointer that we're not tracking. It's just a warning, though, and is only a DEBUG warning, since imprecision of this algorithm is almost guaranteed in any complex program. I've updated the warning to try to make this more clear.

================
Comment at: lib/Analysis/ExternalFunctionAnalysis.cpp:260
@@ +259,3 @@
+    const Value *V = cast<User>(OpVal)->getOperand(0);
+    assert(V && "Couldn't get operand 0 of operand 1 of a User");
+
----------------
JF Bastien wrote:
> The assert is redundant when using cast<>.
The assert is for getOperand rather than the cast. But maybe getOperand can't return null. It will certainly assert if its argument is out of range.

================
Comment at: lib/Analysis/ExternalFunctionAnalysis.cpp:251
@@ +250,3 @@
+
+  size_t annotation_len = array_lengthof(efa_annotation);
+  for (const auto &Us : PtrAnnotation->users()) {
----------------
JF Bastien wrote:
> This should be array_lengthof(efa_annotation) - 1? The array contains the null terminator. The same applies a few places below.
It turns out that the initializer length in the IR also contains the null terminator, so both strings end up with the same length this way.

================
Comment at: lib/Analysis/ExternalFunctionAnalysis.cpp:131
@@ +130,3 @@
+  // This can return NULL if CalledFun is a VarArg function, and the
+  // argument we want is in the "..." part.
+  const Value *Arg = getParameterForArgument(ArgNo, CalledFun);
----------------
JF Bastien wrote:
> Doesn't the algorithm become imprecise if you don't follow vaarg arguments? AFAICT the call may just fail? Realistically it's probably just printf, but it would be good to document what happens with:
> ```
> void ugly(size_t n, ...) {
>   typedef void (*F)();
>   va_list list;
>   va_start(list, n);
>   for (size_t i = 0; i != n; ++n) {
>     F f = va_arg(list, F);
>     (*f)();
>   }
>   va_end(list);
> }
> void callee() {
>   printf("Callee\n");
> }
> int main() {
>   ugly(1, &callee);
> }
> ```
> Maybe add a test if it's expected to work?
Yes, the algorithm becomes (more) imprecise by not following var args. The algorithm does not guarantee that it finds all the dataflow for external function pointers, only that it finds some. In fact, it's relatively easy to write code that the analysis misses even without considering var args: just do pointer arithmetic in C. The analysis doesn't trace dataflow through arithmetic operations, so it will lose track of that external pointer.

In this case, not following the var args means that the analysis will miss places that external functions might be called, and they will become false positives when this analysis is used for Control-flow integrity. The CFI failure function will determine in that case what happens to the call. It needn't necessarily fail.

Note that the case in the code above will not cause a call to fail or be a false positive, since callee is defined in the module. The only time this matters is when code is passing an external function pointer (e.g., from dlsym) to a var args call that then calls this function pointer.

So, I think it's probably not worth the extra complexity to make the analysis more precise on an unusual corner case.

http://llvm-reviews.chandlerc.com/D2873