[llvm-dev] LLD: Using sendfile(2) to copy file contents

Mon Jun 6 02:24:34 PDT 2016

On 5 Jun 2016, at 21:19, Rui Ueyama via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> This is a short summary of an experiment that I did for the linker.
> 
> One of the major tasks of the linker is to copy file contents from input object files to an output file. I was wondering what's the fastest way to copy data from one file to another, so I conducted an experiment.
> 
> Currently, LLD copies file contents using memcpy (input files and an output file are mapped to memory.) mmap+memcpy is not known as the fastest way to copy file contents.
> 
> Linux has sendfile system call. The system call takes two file descriptors and copies contents from one to another (it used to take only a socket as a destination, but these days it can take any file.) It is usually much faster than memcpy to copy files. For example, it is about 3x faster than cp command to copy large files on my machine (on SSD/ext4).
> 
> I made a change to LLVM and LLD to use sendfile instead of memcpy to copy section contents. Here's the time to link clang with debug info.
> 
>     memcpy: 12.96 seconds
>     sendfile: 12.82 seconds
> 
> sendfile(2) was slightly faster but not that much. But if you disable string merging (by passing -O0 parameter to the linker), the difference becomes noticeable.
> 
>     memcpy: 7.85 seconds
>     sendfile: 6.94 seconds
> 
> I think it is because, with -O0, the linker has to copy more contents than without -O0. It creates 2x larger executable than without -O0. As the amount of data the linker needs to copy gets larger, sendfile gets more effective.
> 
> By the way, gold takes 27.05 seconds to link it.
> 
> With the results, I'm not going to submit that change. There are two reasons. First, the optimization seems too system-specific, and I'm not yet sure if it's always effective even on Linux. Second, the current implementations of MemoryBuffer and FileOutputBuffer are not sendfile(2)-friendly because they close file descriptors immediately after mapping them to memory. My patch is too hacky to submit.
> 
> Being said that, the results clearly show that there's room for future optimization. I think we want to revisit it when we want to do a low-level optimization on link speed.

This approach is only likely to yield a speedup if you are copying more than a page, because then there is the potential for the kernel to avoid a memcpy and just alias the pages in the buffer cache (note: most systems won’t do this anyway, but at least then you’re exposing an optimisation opportunity to the kernel).  Using the kernel’s memcpy in place of the userspace one is likely to be slightly slower, as kernel memcpy implementations often don’t take advantage of vector operations, to avoid having to save and restore FPU state for each kernel thread, though if you’re having cache misses then these won’t make much difference (and if you’re on x86, depending on the manufacturer, you may hit a pattern that the microcode recognises and have your code replaced entirely with a microcoded memcpy).  

One possible improvement would be to have a custom memcpy that used non-temporal stores, as this memory is likely not to be used at all on the CPU in the near future (though on recent Intel chips, the DMA unit shares LLC with the CPU, so will pull it back into L3 on writeback) and probably not DMA’d for another 10-30 seconds (if it’s sooner, then this can adversely affect performance, because on Intel chips the DMA controller is limited to using a subset of the cache, so having the CPU pull things into cache that are going to be DMA’d out can actually increase performance - ironically, some zero-copy optimisations actually harm performance on these systems).  This should reduce cache pressure, as the stores will all go through a single way in the (typically) 8-way associative cache.  If this is also the last time that you’re going to read  the data, then using non-temporal loads may also help.  Note, however, that the interpretation of the non-temporal hints is advisory and some x86 microcode implementations make quite surprising decisions.

David