[PATCH] D78717: [SystemZ] Implement -fstack-clash-protection

Fri May 1 00:50:10 PDT 2020

jonpa added a comment.

>> gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.
> 
> I'm not seeing this, do you have an example?

  void large_stack() {
    volatile int stack[2000], i;
    for (i = 0; i < sizeof(stack) / sizeof(int); ++i)
      stack[i] = i;
  }

With stack[2000], I see

  aghi    %r15,-4096
  cg      %r0,4088(%r15)
  aghi    %r15,-4072
  cg      %r0,4064(%r15)

With stack[8000], I don't see a probe after the loop...

> 
> 
>> I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?
> 
> The comment you're adding says:
> 
>   // stack probing may involve looping, and control flow generations is
>   // disallowed at this point. Rely to later processing through
>   // `inlineStackProbe`.
>    
>    
> 
> That would seem to be the reason why it cannot be done directly inline, right?

Yes, but I can't see anything other than that comment that (at least currently) demands this - the only use for both emitPrologue() and inlineStackProbe() are right after each other in PEI::insertPrologEpilogCode().

>> I took the X86 tests and copied them over as SystemZ tests and noticed that SystemZ gets these test cases built by SelectionDAGBuilder with dynamic_stackalloc nodes, while X86 seem to get these (constant) allocas merged into the stack frame. This is true also without this patch, but I am not sure why. In this case it seems even more preferred to avoid the dynamic_stackalloca nodes whenever possible...
> 
> That's strange.   Not sure why we should be different from X86 here.  Can you investigate?

This was due to SystemZ having StackRealignable set to false, and the allocas of those test cases were aligned to 16. Removing the alignments in the tests fixed it (see FunctionLoweringInfo.cpp:143).

> As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I am a little unsure of how performance critical this is... It seems that the handling of inlineStackProbe with a compile-time constant stack size is an optimization of the general case of an unknown size. It seems to be rather important, since the loop is unrolled if only a few iterations are required.

I think we could emit a pseudo from emitProbedAlloca() and handle that in SystemZFrameLowering instead in the same place, if we would allow the case of a constant size be loaded into a register before the loop. The loop itself might also have one extra instruction or so, but the common case would  be to unroll that loop...

Can R0D and R1D be assumed to be available in the prologue at this point? (how about anyreg-callconv?)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78717/new/

https://reviews.llvm.org/D78717