[llvm] [Support] Always call FlushFileBuffers() when unmapping memory on Windows (PR #78597)

Sat Jan 20 06:52:20 PST 2024

mstorsjo wrote:

> > Try `lld-link .. /time` before/after patch? Or `Measure-Command { lld-link .. }` with powershell if you want to be more precise.
> 
> If you can do that, and also measure the performance impact of this change (on top of this one), that'd be great:

I took a shot at measuring this now after all. I don't have any case where I build binaries that big, but in building a statically linked `clang.exe` at 93 MB, I get an 854 MB PDB file, so that's certainly big enough to be measurable.

The difference in speed is indeed notable. Originally (with no flush calls, i.e. running on a new enough version of Windows), linking takes 4.2 seconds. By always calling `FlushFileBuffers`, this rises to 10.6 seconds, and by using `FlushViewOfFile` as suggested by @cjacek, it takes 10.4 seconds. (I'm not sure if the difference between the latter two is statistically significant or not, I didn't do a very large number of samples.)

With a breakdown with `lld-link /time`, I get the following profiles:
```
No flush
--------
  Input File Reading:             575 ms ( 14.9%)
  GC:                             262 ms (  6.8%)
  Code Layout:                    213 ms (  5.5%)
  Commit Output File:              15 ms (  0.4%)
  PDB Emission (Cumulative):     2736 ms ( 70.7%)
    Add Objects:                 1705 ms ( 44.1%)
      Global Type Hashing:        240 ms (  6.2%)
      GHash Type Merging:         437 ms ( 11.3%)
      Symbol Merging:            1023 ms ( 26.4%)
    Publics Stream Layout:         27 ms (  0.7%)
    TPI Stream Layout:             24 ms (  0.6%)
    Commit to Disk:               719 ms ( 18.6%)
--------------------------------------------------
Total Linking Time:              3871 ms (100.0%)

FlushFileBuffers
----------------

  Input File Reading:             579 ms (  5.5%)
  GC:                             260 ms (  2.5%)
  Code Layout:                    208 ms (  2.0%)
  Commit Output File:             620 ms (  5.8%)
  PDB Emission (Cumulative):     8881 ms ( 83.7%)
    Add Objects:                 1848 ms ( 17.4%)
      Global Type Hashing:        375 ms (  3.5%)
      GHash Type Merging:         444 ms (  4.2%)
      Symbol Merging:            1023 ms (  9.6%)
    Publics Stream Layout:         27 ms (  0.3%)
    TPI Stream Layout:             24 ms (  0.2%)
    Commit to Disk:              6721 ms ( 63.3%)
--------------------------------------------------
Total Linking Time:             10614 ms (100.0%)

FlushViewOfFile
---------------
  Input File Reading:             582 ms (  5.7%)
  GC:                             261 ms (  2.6%)
  Code Layout:                    207 ms (  2.0%)
  Commit Output File:             574 ms (  5.7%)
  PDB Emission (Cumulative):     8448 ms ( 83.4%)
    Add Objects:                 1708 ms ( 16.9%)
      Global Type Hashing:        235 ms (  2.3%)
      GHash Type Merging:         437 ms (  4.3%)
      Symbol Merging:            1030 ms ( 10.2%)
    Publics Stream Layout:         27 ms (  0.3%)
    TPI Stream Layout:             24 ms (  0.2%)
    Commit to Disk:              6428 ms ( 63.4%)
--------------------------------------------------
Total Linking Time:             10134 ms (100.0%)
```

So this is clearly a regression wrt performance for these cases, while it improves correctness.

I'll try to do a PoC for hooking up heuristics via the `is_local` function. (I'm not sure if we have any better guess for cases where we'd need to be extra cautious with flushing files?) The heuristic itself will add some runtime cost, but certainly less than the extra overhead in linking large binaries/PDBs.

https://github.com/llvm/llvm-project/pull/78597