[llvm-dev] LLD: Using sendfile(2) to copy file contents

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Mon Jun 6 14:55:16 PDT 2016

On Mon, Jun 6, 2016 at 2:24 AM, David Chisnall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> On 5 Jun 2016, at 21:19, Rui Ueyama via llvm-dev <llvm-dev at lists.llvm.org>
> wrote:
> >
> > This is a short summary of an experiment that I did for the linker.
> >
> > One of the major tasks of the linker is to copy file contents from input
> object files to an output file. I was wondering what's the fastest way to
> copy data from one file to another, so I conducted an experiment.
> >
> > Currently, LLD copies file contents using memcpy (input files and an
> output file are mapped to memory.) mmap+memcpy is not known as the fastest
> way to copy file contents.
> >
> > Linux has sendfile system call. The system call takes two file
> descriptors and copies contents from one to another (it used to take only a
> socket as a destination, but these days it can take any file.) It is
> usually much faster than memcpy to copy files. For example, it is about 3x
> faster than cp command to copy large files on my machine (on SSD/ext4).
> >
> > I made a change to LLVM and LLD to use sendfile instead of memcpy to
> copy section contents. Here's the time to link clang with debug info.
> >
> >     memcpy: 12.96 seconds
> >     sendfile: 12.82 seconds
> >
> > sendfile(2) was slightly faster but not that much. But if you disable
> string merging (by passing -O0 parameter to the linker), the difference
> becomes noticeable.
> >
> >     memcpy: 7.85 seconds
> >     sendfile: 6.94 seconds
> >
> > I think it is because, with -O0, the linker has to copy more contents
> than without -O0. It creates 2x larger executable than without -O0. As the
> amount of data the linker needs to copy gets larger, sendfile gets more
> effective.
> >
> > By the way, gold takes 27.05 seconds to link it.
> >
> > With the results, I'm not going to submit that change. There are two
> reasons. First, the optimization seems too system-specific, and I'm not yet
> sure if it's always effective even on Linux. Second, the current
> implementations of MemoryBuffer and FileOutputBuffer are not
> sendfile(2)-friendly because they close file descriptors immediately after
> mapping them to memory. My patch is too hacky to submit.
> >
> > Being said that, the results clearly show that there's room for future
> optimization. I think we want to revisit it when we want to do a low-level
> optimization on link speed.
> This approach is only likely to yield a speedup if you are copying more
> than a page, because then there is the potential for the kernel to avoid a
> memcpy and just alias the pages in the buffer cache (note: most systems
> won’t do this anyway, but at least then you’re exposing an optimisation
> opportunity to the kernel).

This assumes that the from/to addresses have the same offset modulo the
page size, which I'm not sure is ever really the case for input sections
and their location in the output.

>   Using the kernel’s memcpy in place of the userspace one is likely to be
> slightly slower, as kernel memcpy implementations often don’t take
> advantage of vector operations, to avoid having to save and restore FPU
> state for each kernel thread, though if you’re having cache misses then
> these won’t make much difference (and if you’re on x86, depending on the
> manufacturer, you may hit a pattern that the microcode recognises and have
> your code replaced entirely with a microcoded memcpy).
> One possible improvement would be to have a custom memcpy that used
> non-temporal stores, as this memory is likely not to be used at all on the
> CPU in the near future (though on recent Intel chips, the DMA unit shares
> LLC with the CPU, so will pull it back into L3 on writeback) and probably
> not DMA’d for another 10-30 seconds (if it’s sooner, then this can
> adversely affect performance, because on Intel chips the DMA controller is
> limited to using a subset of the cache, so having the CPU pull things into
> cache that are going to be DMA’d out can actually increase performance -
> ironically, some zero-copy optimisations actually harm performance on these
> systems).  This should reduce cache pressure, as the stores will all go
> through a single way in the (typically) 8-way associative cache.  If this
> is also the last time that you’re going to read  the data, then using
> non-temporal loads may also help.  Note, however, that the interpretation
> of the non-temporal hints is advisory and some x86 microcode
> implementations make quite surprising decisions.

I don't think that the performance problem of the memcpy here is Dcache
related (it is just a memcpy and so should prefetch well). I clocked that
our memcpy to the output is getting < 1GB/s throughput (on a machine that
can do >60GB/s DRAM bandwidth; see http://reviews.llvm.org/D20645#440638).
My guess is that the problem here is more about virtual memory cost (kernel
having to fix up page tables, zero-fill, etc.).

-- Sean Silva

> David
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160606/26bd5f18/attachment.html>

More information about the llvm-dev mailing list