[PATCH] D131628: [LangRef] Add description for nocallback attribute

Mon Aug 15 09:32:49 PDT 2022

jdoerfert added a comment.

In D131628#3721285 <https://reviews.llvm.org/D131628#3721285>, @nlopes wrote:

> In D131628#3719212 <https://reviews.llvm.org/D131628#3719212>, @jdoerfert wrote:
>
>> In D131628#3719136 <https://reviews.llvm.org/D131628#3719136>, @nlopes wrote:
>>
>>> Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.
>>
>> Let's just resolve this now and here.
>>
>> I figured us using it implies we need it. However, here is some data: https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf
>
> My question is normal and reasonable.

My answers are too :). One thing that is missing from this conversation is an example of a breakage. Or did I miss that?

> I ask it every time someone wants to add something new to the IR.
> The LLVM community must know why it has to deal with extra complexity forever. I will refuse any IR feature that gives 0.00001% speedup for 1 benchmark.

That is great, but not the case here.

>> Note section IV-B2 which talks about inter-procedural reachability, that is only possible with nocallback on intrinsics (and leaf on malloc and free in some cases).
>
> No numbers are provided. My question still stands.

There are numbers in the paper, both for the impact of the entire inter-procedural store-load forwarding as well as the different parts.
All OpenMP GPU benchmarks benefit from this substantially, especially once all paper parts have been upstreamed. 
That is why we did the study (=paper) and prototype. Cleaning it up and upstreaming simply takes time.

> Plus, malloc/free have the `inaccessiblememonly` attribute, which means they don't need the `nocallback` attribute. These attributes are transitive, so it's a given that malloc cannot call any function that doesn't have the `inaccessiblememonly` attribute.

That is easy to say and hard to do. If you set analysis up composable it's hard to embed "cross query" arguments, e.g., no local memory is accesses imply no local code can be executed.
At the end of the day you never know why some user might ask for reachability, or a call graph. The logic you propose doesn't work to deduce norecurse for the caller, for example.
That said, we are about to replace malloc/free on the GPU with custom implementations (=definitions) this is necessary for AMD anyway and the NVIDIA impl is slow, so replacing it makes sense.
As that change dropped, neither malloc nor free will be `inaccessiblememonly` anymore. Most other libc functions we are about to provide would be annotated as well, incl. the ones that go to the host with rpc calls.

> What's the impact of this attribute in your reachability analysis?

It allows to avoid spurious call edges due to intrinsics and annotated functions (see above) instead of treating intrinsics as if they would be implicitly nocallback.

Below, we want to know if `g()` can be reached by `f()`. We can only be answered with "no" if we have `nocallback` on llvm.memset.

  void kernel() { 
   g();
   f();
  }
  static void f() {
    llvm.memset(...) 
  }
  static void g() { ... }

Now you might argue that LLVM already builds a call graph where that can be determined, but that is only because it erroneously assumes `nocallback` (effectively) to any intrinsic.
Let's look at the following example (https://godbolt.org/z/nef9n6qob)

  define dso_local void @test(ptr %A) {
  entry:
    call void @llvm.memset.p0.i64(ptr %A, i8 0, i64 100, i1 false)
    call void @llvm.unknown.new.intrinsic(ptr @bar)
    ret void
  }

  declare void @bar()

  declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)

  declare void @llvm.unknown.new.intrinsic(ptr)

`@test` is calling two functions and the unknown intrinsic might just as well call `@bar`. However, the call graph doesn't reflect either:

  Call graph node <<null function>><<0x555c5607bb10>>  #uses=0
    CS<None> calls function 'test'
    CS<None> calls function 'bar'
    CS<None> calls function 'llvm.memset.p0.i64'
    CS<None> calls function 'llvm.unknown.new.intrinsic'

  Call graph node for function: 'bar'<<0x555c5607bc50>>  #uses=1
    CS<None> calls external node

  Call graph node for function: 'llvm.memset.p0.i64'<<0x555c5607bcd0>>  #uses=1

  Call graph node for function: 'llvm.unknown.new.intrinsic'<<0x555c5607be00>>  #uses=1

  Call graph node for function: 'test'<<0x555c5607bbd0>>  #uses=1

  Compiler returned: 0

>>> 2. [...]  A simple hack would be to drop all nocallback attributes when linking IR files. That would fix the LTO case.
>>
>> `nocallback` follows, as I mentioned before, the same idea as `inaccesiblememory`. You are not wrong about LTO, but we can apply the same handling to both which is to drop the attribute during a llvm-link step. We only need to do that if we link in the definition. That is why I said this should only be valid on declarations. (inaccessiblememory should also have that restriction).
>
> It's not the same thing as `nocallback` is given by users and `inaccessiblememonly` is not.

True. But that might be subject to change as more people want to expose more attributes to the user. Different discussion though.

> LLVM adds the latter only to libcalls. It's not always wrong to do LTO with it; only if a lib function calls another function and they share the same globals -- not sure it exists in practice.

That is not true and not what llvm-link does. We need to drop either attribute if we link in a definition. Let's run an example:

`a.ll`

  define i32 @main() {
  entry:
    %r1 = call i32 @foo1()
    %r2 = call i32 @foo2()
    ret i32 %r2
  }

  declare i32 @foo1() inaccessiblememonly
  declare i32 @foo2() inaccessiblememonly

`b.ll`

  @g = internal global i32 0

  define i32 @foo1() noinline {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    ret i32 %l
  }
  define i32 @foo2() {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    ret i32 %l
  }

Now we run `llvm-link a.ll b.ll -o c.ll`, note that no globals are shared, `a.ll` has no globals after all.

`c.ll`

  @g = internal global i32 0

  define i32 @main() {
  entry:
    %r1 = call i32 @foo1()
    %r2 = call i32 @foo2()
    ret i32 %r2
  }

  ; Function Attrs: noinline
  define i32 @foo1() #0 {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    ret i32 %l
  }

  define i32 @foo2() {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    ret i32 %l
  }

  attributes #0 = { noinline }

`inaccessiblememonly` is gone (which is correct and necessary).

> Nevertheless, we cannot introduce a new IR feature that we know that is buggy and without a plan to fix it. Absolutely no way! The patch must be reverted first. Then you can propose a plan to reintroduce the feature in a sound way if it's justified. We need numbers please.

The feature is not buggy. LTO works fine as the `inaccessiblememonly` is actually not hardcoded but a side-effect.
Let's do the test:
`a.ll`

  define i32 @main() {
  entry:
    %r1 = call i32 @foo1()
    %r2 = call i32 @foo2()
    ret i32 %r2
  }

  declare i32 @foo1() inaccessiblememonly nocallback
  declare i32 @foo2() inaccessiblememonly nocallback

`b.ll`

  @g = internal global i32 0

  define i32 @foo1() noinline {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    ret i32 %l
  }
  define i32 @foo2() {
  entry:
    %l = load i32, i32* @g
    %a = add i32 %l, 1
    store i32 %a, i32* @g
    call i32 @foo1()
    ret i32 %l
  }

Now we run LTO:
`llvm-lto a.bc b.bc --save-linked-module -o linked`
and look at the merged module (effectively llvm-linked module)
`opt -S linked.linked.bc`

  @g = internal global i32 0

  define internal i32 @main() {
  entry:
    %r1 = call i32 @foo1()
    %r2 = call i32 @foo2()
    ret i32 %r2
  }

  ; Function Attrs: noinline
  define internal i32 @foo1() #0 {
  entry:
    %l = load i32, ptr @g, align 4
    %a = add i32 %l, 1
    store i32 %a, ptr @g, align 4
    ret i32 %l
  }

  define internal i32 @foo2() {
  entry:
    %l = load i32, ptr @g, align 4
    %a = add i32 %l, 1
    store i32 %a, ptr @g, align 4
    ret i32 %l
  }

  attributes #0 = { noinline }

All attributes we needed to drop are gone.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131628/new/

https://reviews.llvm.org/D131628