[flang-commits] [PATCH] D148910: [flang] Turn on use-desc-for-alloc by default

Jean Perier via Phabricator via flang-commits flang-commits at lists.llvm.org
Fri Apr 21 05:09:00 PDT 2023


jeanPerier added a comment.

In D148910#4286583 <https://reviews.llvm.org/D148910#4286583>, @tblah wrote:

> Could there be cases where LLVM optimization passes take longer to run because they are removing these descriptors we know we don't need? I would guess this is okay for real code but I was wondering if you have tried it?

Good question. The problem is the "we know we don't need". At that point of the lowering code, we do not really know that we do not need a descriptor, in fact, we still create one with this option turned off, because as soon as the pointer/allocatable is passed as a pointer/allocatable to the runtime or user code, we need a descriptor an we insert sync operation from/to these values after/before the call. This early optimization is only a win if the pointer/allocatable never escapes.

Here is some dumb numbers to illustrate:

program_1.f90

  subroutine foo(x)
   real, target :: x(100)
   real, pointer, contiguous :: p(:)
   p => x
   call bar(p(50))
   ! ...
   ! ... repeat call bar(p(50)) 10000 times.
   ! ...
  end subroutine

`--use-desc-for-alloc=false`:

- Number of .ll lines after codgeneration from FIR to LLVM: 50k.
- time flang-new -O1 (on a beefy X86_64 machine): 2s.

`--use-desc-for-alloc=true`:

- Number of .ll lines after codgeneration from FIR to LLVM: 180k.
- time flang-new -O1 (on a beefy X86_64 machine): 5s.

-> Same code after -O1, but x2 faster compilation with the previous `--use-desc-for-alloc=false` option.

program_2.f90:

  subroutine foo(x)
   real, target :: x(100)
   real, pointer, contiguous :: p(:)
   interface
    subroutine bar_ptr(p)
      real, pointer, contiguous :: p(:)
    end subroutine
   end interface
  
   p => x
   call bar(p(50))
   ! ...
   ! ... repeat call bar_ptr(p(50)) 10000 times.
   ! ...
  end subroutine

`--use-desc-for-alloc=false`:

- Number of .ll lines after codgeneration from FIR to LLVM: 300k.
- time flang-new -O1 (on a beefy X86_64 machine): 25s. //And did not managed to get rid of all the intermediate descriptor sync, even at -O3!//

`--use-desc-for-alloc=true`:

- Number of .ll lines after codgeneration from FIR to LLVM: 10k.
- time flang-new -O1 (on a beefy X86_64 machine): 0.8s.

-> x30 (!) faster compilation with the new `--use-desc-for-alloc=true` option, and better code with it (the sync from/to descriptor are not optimized away with -O3 with the `--use-desc-for-alloc=flase`). The reason for the sync optimization failure is that LLVM cannot guess that it can get rid of the code that sets-up the CFI code and versions in the descriptor when doing the sync from the values to the fir.box (it assumes the previous call may have modified those descriptor fields, and that it should keep these field updates generate by the fir.embox before the next calls).

**Conclusion: the answer to your question depends of the actual workloads... I was able to write you stupid code that compiles 30x faster with this patch, but it is probably not representative** (This only shows if passing the pointer/allocatabale to a procedure taking a pointer/allocatable a lot). The take-away is that the inserted "sync" operations of the current lowering are expensive and hard to optimize away, while it is easy to get rid of descriptors. So this was probably a premature optimization anyway. Overall, on "actual" workloads, I do not really expect we are able to measure much difference after this patch. But if you do see perf regressions, we can revert that and try to add a FIR pass to "scalarize" fir.box before we turn them into big LLVM structs and many insert/extract operations. This pass would be able to easily see the cost of scalarization, which we are not at that point.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148910/new/

https://reviews.llvm.org/D148910



More information about the flang-commits mailing list