[flang-commits] [PATCH] D88981: [flang] Rework host runtime folding and enable REAL(2) folding with it.

Wed Oct 14 06:15:19 PDT 2020

jeanPerier added a comment.

> The patch compiles successfully with msvc (with a patch to trunk that I still need to upload a patch for).

Thanks for testing this @Meinersbur !

================
Comment at: flang/include/flang/Evaluate/common.h:11
 #define FORTRAN_EVALUATE_COMMON_H_

 #include "flang/Common/Fortran.h"
----------------
klausler wrote:
> Does removing this member from the folding context make them cheap to construct again?
Yes, FoldingContext are 100 times cheaper to construct according to my measurements. This improves fcvs `f18 -fparse-only` time by on 12% on average.

**FoldingContext ctor 100x speedup**
With f18 compiled with gcc 8.3 in release mode on an Intel Xeon Gold 6148, I measured 0.05ms per FoldingContext construction before vs 0.00005ms with this patch (average of 10000 ctor calls in one run. I reproduced runs 10times and got stable results). Measurement were done by instrumenting the code (https://github.com/jeanPerier/llvm-project/commit/f511284b54805aa314c1316f9143d0d0cbaa522d).

Given FoldingContext are constructed for every function call check when an explicit interface that can translate in x4 speed-up one `time f18 -fparse-only` on carefully designed tests like:

```
 real, parameter :: x = 0.5

 ! Each following line semantic analysis end-up in 3 FoldingContext ctor call 
 real, parameter :: y1 = acos(x)
 real, parameter :: y2 = acos(x)
 ! ... repeated 9997 times 
 real, parameter :: y10000 = acos(x)
end
```

I measured 2s before vs 0.5s with this patch (`time f18 -fparse-only` real time).

**Host folding 1.2x slowdown**
However, there is a 20% time penalty with this patch per fold with host runtime (most likely due to the added encapsulation/decapsulation of Scalar to/from Expr<SomeExpr> in the folder). I measured the time spent in Evaluate/fold-real.cpp `FoldIntrinsicFunction` on the test file above. We spent 1.3usec per fold before vs 1.6usec with this patch (average of the 10000 folds, repeated 10 times). Given a for this is at the usec level, it is negligible on scalar fold since we create 3 FoldingContext per expressions. For array expressions, that can lead to overall slowdown in the compilation (that will never be bigger than 20%). For instance I could measure a 1% overall slow-down in a program folding `acos( a_10000_element_array)` (93ms before vs 94 now).

**Conclusion: 12% overall parsing+semantics speed-up on real code**
Regarding fcvs `time f18 -fparse-only fm*.f` real time went from 4.3s to 3.8s (ten run average). So this has a visible impact on real code.
Since scalar folding is much more widespread than huge array folding, the patch seems a win to me.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88981/new/

https://reviews.llvm.org/D88981