[PATCH] D78717: [SystemZ] Implement -fstack-clash-protection

Mon May 4 10:09:33 PDT 2020

uweigand added a comment.

In D78717#2014532 <https://reviews.llvm.org/D78717#2014532>, @jonpa wrote:

> >> gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.
> > 
> > I'm not seeing this, do you have an example?
>
>
>
>   void large_stack() {
>     volatile int stack[2000], i;
>     for (i = 0; i < sizeof(stack) / sizeof(int); ++i)
>       stack[i] = i;
>   }
>
>
> With stack[2000], I see
>
>   aghi    %r15,-4096
>   cg      %r0,4088(%r15)
>   aghi    %r15,-4072
>   cg      %r0,4064(%r15)
>   
>
> With stack[8000], I don't see a probe after the loop...

Well, I see it:

          lay     %r1,-28672(%r15)
  .L5:
          lay     %r15,-4096(%r15)
          cg      %r0,4088(%r15)
          clgr    %r15,%r1
          jh      .L5
          lgr     %r15,%r1
          lay     %r15,-3496(%r15)
          cg      %r0,3488(%r15)

There's one cg in the loop, and this last cg for the tail.  That's with GCC 9, not sure if that changed recently.

>>> I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?
>> 
>> The comment you're adding says:
>> 
>>   // stack probing may involve looping, and control flow generations is
>>   // disallowed at this point. Rely to later processing through
>>   // `inlineStackProbe`.
>>    
>>    
>> 
>> That would seem to be the reason why it cannot be done directly inline, right?
> 
> Yes, but I can't see anything other than that comment that (at least currently) demands this - the only use for both emitPrologue() and inlineStackProbe() are right after each other in PEI::insertPrologEpilogCode().

Hmm, not sure either.   But either the comment is correct or not; if it is, we need this split, if it isn't, the comment should be removed :-)   Can you ask who added that callback for Intel why they thought this was necessary?

>> As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.
> 
> I am a little unsure of how performance critical this is... It seems that the handling of inlineStackProbe with a compile-time constant stack size is an optimization of the general case of an unknown size. It seems to be rather important, since the loop is unrolled if only a few iterations are required.
> 
> I think we could emit a pseudo from emitProbedAlloca() and handle that in SystemZFrameLowering instead in the same place, if we would allow the case of a constant size be loaded into a register before the loop. The loop itself might also have one extra instruction or so, but the common case would  be to unroll that loop...

Hmm, there's probably enough differences that may make it not worth spending effort here, given that you've already implemented it now.

> Can R0D and R1D be assumed to be available in the prologue at this point? (how about anyreg-callconv?)

Yes.  See SystemZCallingConv.td:

  // "All registers" as used by the AnyReg calling convention.
  // Note that registers 0 and 1 are still defined as intra-call scratch
  // registers that may be clobbered e.g. by PLT stubs.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78717/new/

https://reviews.llvm.org/D78717