[PATCH] D94267: [PDB] Defer relocating .debug$S until commit time and parallelize it

Tue Jan 12 14:02:59 PST 2021

aganea added a comment.

In D94267#2491787 <https://reviews.llvm.org/D94267#2491787>, @rnk wrote:

> In D94267#2491643 <https://reviews.llvm.org/D94267#2491643>, @aganea wrote:
>
>> This is mostly caused by contention when page faulting on either mmap'ed files or zero-pages. KeYieldProcessorEx is spinning while waiting for a lock for bringing pages into the 'working set'.
>
> Interesting, I've heard similar things about LLD ELF. I wonder how much of this is IO and how much of this is locks around modifying the process page directory.

In my case at least, it's exclusively due to virtual page management. There's no disk IO, everything was already in cache. Only the two spikes at the end, which is the deferred System write of the PDB & the EXE.
F15010912: lld_io_cached.PNG <https://reviews.llvm.org/F15010912>

After cleaning the Windows cache, I get this:
(the top graph is CPU usage, the bottom graph is disk IO throughput)
F15013311: lld_io_cold.PNG <https://reviews.llvm.org/F15013311>

Since the "Input File Reading" & "GC" are single-threaded, the application itself is the bottleneck rather than the disk. The Raid array on the machine is able to sustain 6.2 GB/s read, measured. Even in the cases where it's multithreaded, the disk IO never reaches that value, the "PDB Emission" takes exactly the same time regardless of cache. I think on a HDD the IO could be an issue, but not on modern SSDs.

    Input File Reading:            9525 ms ( 26.5%)
    GC:                           13852 ms ( 38.6%)
    Code Layout:                    982 ms (  2.7%)
    Commit Output File:              38 ms (  0.1%)
    PDB Emission (Cumulative):    11030 ms ( 30.7%)
      Add Objects:                 6442 ms ( 17.9%)
        Global Type Hashing:        889 ms (  2.5%)
        GHash Type Merging:        1349 ms (  3.8%)
        Symbol Merging:            3754 ms ( 10.4%)
      Publics Stream Layout:        620 ms (  1.7%)
      TPI Stream Layout:             51 ms (  0.1%)
      Commit to Disk:              2595 ms (  7.2%)
  --------------------------------------------------
  Total Link Time:                35930 ms (100.0%)    <-- cold cache, was 17 sec with hot cache

> Similar to inserting prefetches, I was wondering if there were some APIs we can use to load the obj in phases:
>
> - reserve memory for the entire file
> - commit only the portions of the object used for symbol table, section table, and relocations
> - resolve symbols, run linker GC
> - commit section content memory for sections marked live, do not load memory for non-live sections
>
> This would be much more explicit, similar to an explicit seeks and reads, explicitly getting the data from the FS when you need it.

Yes, that's pretty much what `PrefetchVirtualMemory` does: you give it a bunch of memory ranges, and it would fetch them all in parallel for you, in the background. When a memory-mapped is open, nothing is commited. Prefetching the memory-mapped pages would initiate the IO, and then bring the pages into the process space. Ideally, we should compute the file regions and explicitly prefetch them as early in the process as possible.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94267/new/

https://reviews.llvm.org/D94267