[PATCH] D147975: [StackProtector] don't check stack protector before calling nounwind functions

Wed Apr 12 09:50:48 PDT 2023

nickdesaulniers added a comment.

In D147975#4259794 <https://reviews.llvm.org/D147975#4259794>, @LuoYuanke wrote:

> For this case
>
>   void caller () {
>     int arr [42] = {};
>     int (*callee)() = get_callback();
>     vuln();
>     callee(arr);
>   }
>
> If callee is not noreturn function, caller has a chance to check the canary when it returns.

The issue is that if `callee` is not `noreturn`, `vuln` may have overwritten the value of `callee` (if it was on the stack) and the stack canary is not checked BEFORE transferring control flow to an arbitrary location.  But if `callee` was `noreturn` we would? Echoing the sentiment of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58245#c6, that lack of symmetry seems odd.

Consider another example:

  void caller () {
    int data [42] = {};
    int (*callback)() = get_callback();
    vuln();
    callee(callback, data);
  }

So the stack gets corrupted within `caller`'s frame, possibly modifying `callback` if it was on the stack.  `caller` gets a stack canary, but the check is after the return of `callee` to `caller`.  In the meantime, `callee` assumes the stack data is not corrupted and invokes `callback`, which may be arbitrary attacker code now due to `vuln`, and might not actually ever return where any such stack canary would be checked.  Oops!

In these two examples, nothing is `noreturn`, and yet you can still hijack control flow, so why treat `noreturn` functions specially?  This is my interpretation of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58245#c6 and I agree with the sentiment.

Once a caller has had its stack frame corrupted, literally any stack references passed to a callee are dangerous to use to make control flow decisions.  I also don't think that the callee can detect such corruption of the parent's/caller's frame.

Another issue is that a stack canary is weak indicator that the stack has been tampered with.  You can maliciously modify stack memory without overwriting the stack slot containing the canary, or if the canary was leaked it can be rewritten in the correct slot.  So adding more and more and more stack canary checks is nearly meaningless work (IMO).

> As a tradeoff of security, code size and performance, I don't object to land the patch. I am curious about the noreturn function that is caught in your case. Could you show some examples of those noreturn function?

The kernel code is basically:

  void start_kernel () {
    ...
    initialize_the_stack_canary();
    do_rest();
  }
  void do_rest () {
    ...
    while (1);
  }

LLVM:

1. infers that `do_rest` is implicitly `noreturn`.
2. decides that `start_kernel` must now have a stack canary

At runtime:

1. `start_kernel` writes an uninitialized canary value into a stack slot.
2. `initialize_the_stack_canary` changes the value of canary (globally, not in the stack slot)
3. canary reloaded and compared against the value in the stack slot, which was uninitialized garbage
4. `__stack_chk_fail` -> `panic` -> machine fails to boot at startup. ---

In D147975#4260221 <https://reviews.llvm.org/D147975#4260221>, @efriedma wrote:

> The kernel should probably disable stack canaries for the functions in question?

Yes, I plan to do that //in addition to this change//.  But it's important to demonstrate to the kernel community that we're working on this problem from both angles.  Because kernel developers will ask "what is your toolchain smoking?" (rightfully) when they see this.

For the kernel, I will have to:

1. add support for `__attribute__((no_stack_protector))`. This doesn't exist for all supported toolchain versions (https://docs.kernel.org/process/changes.html) particularly GCC only gained support in gcc-11 which is relatively recent.
2. backport this to all supported kernel versions (https://kernel.org/)

It is infeasible (IMO) to move `start_kernel` to its own TU and build that function with `-fno-stack-protector`; the function attributes give us the ideal granularity, but aren't available everywhere we need to support (temporal versions of kernel sources, and temporal versions of supported toolchains).

---

In D147975#4259865 <https://reviews.llvm.org/D147975#4259865>, @efriedma wrote:

>> I think noreturn is a red-herring
>
> noreturn without nounwind serves as a hint that the callee unconditionally throws an exception.  (The only other alternatives are that the callee unconditionally calls abort(), or unconditionally spins in an infinite loop; such functions are relatively rare in most codebases.)  But it's not really a strong heuristic; functions that are noreturn aren't guaranteed to throw, and functions that aren't noreturn frequently throw.

Is that a vote of confidence for or against this patch?

> Longer-term, it might make sense to consider encoding information about the stack canary into DWARF data; if the unwinder can check the canary itself, that catches the relevant cases without a bunch of inline code.

DWARF does encode location lists which IIRC may contain which stack slot certain variables reside in for certain ranges of the program counter.  I could imagine a special new tag to say "here's where the stack canary is."

What happens if code is built with `-g0`? Does that mean such info is not available? Or does C++ always emit such metadata?

---

In D147975#4259978 <https://reviews.llvm.org/D147975#4259978>, @smeenai wrote:

> In D147975#4259522 <https://reviews.llvm.org/D147975#4259522>, @nickdesaulniers wrote:
>
>> Android undid https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821 here due to measured 2% RSS increase: https://android-review.git.corp.google.com/c/platform/build/soong/+/2524336.  I will add that note to the commit message.  Unless there's more feedback about this change, such as something I may be missing, I plan to commit this tomorrow and request a cherry-pick to the release/16.x branch.
>
> The non-internal version of this link is https://android-review.googlesource.com/c/platform/build/soong/+/2524336. You should update it in the Summary as well.

Good catch, will do.

---

In D147975#4260253 <https://reviews.llvm.org/D147975#4260253>, @LuoYuanke wrote:

>> We could consider trying to encode more information into the IR, I guess.  In a lot of cases, we know a "noreturn nounwind" function is abort()-like, and not longjmp()-like; we just don't bother recording it anywhere because nothing cared before this.
>
> In IR if a function call is within try{} block, front-end should generate invoke instruction, so we may just check if the callsite is invoke instruction.

It's a good idea, but I don't think it's correct. Consider the following example:

  void foo ();
  void bar ();

  void baz () {
      foo();
      try {
          bar();
      } catch (int e) {}
  }

`invoke` is only used in `baz` when calling `bar`, but not when calling `foo`.  If `foo` calls `__cxa_throw`, it will do so via `call` not `invoke` (and so would `bar`).  So `foo` (and `bar`) would not check the stack canary before calling `__cxa_throw`, which is what the CHOP attack paper was concerned with.  Is my understanding precise?

---

In D147975#4259522 <https://reviews.llvm.org/D147975#4259522>, @nickdesaulniers wrote:

> Unless there's more feedback about this change, such as something I may be missing, I plan to commit this tomorrow and request a cherry-pick to the release/16.x branch.

Based on the feedback, I will hold this for one more day (will update the broken link in the commit message though).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147975/new/

https://reviews.llvm.org/D147975