[llvm-dev] slow performance in llc.exe to do with large global floating point arrays
Eli Friedman via llvm-dev
llvm-dev at lists.llvm.org
Thu May 9 18:30:23 PDT 2019
Have you tried -time-passes to see where it's actually spending time? I don't think there's currently a timer that covers printing global variables to assembly, but you should be able to rule out something else.
Currently, the fastest path for emitting global data into an object file is an i8 array "[1000000 x i8]"; given a module in memory, we make one extra copy over the ideal of just directly calling write() on the bits, which should be fast enough for most purposes. A "[1000000 x float]" currently uses a less efficient path, which copies the values one by one, but it probably wouldn't be hard to optimize. If you're emitting something that isn't just an array of constant data, it gets less efficient. See emitGlobalConstantDataSequential in lib/CodeGen/AsmPrinter.cpp.
-Eli
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Chris Lovett via llvm-dev
Sent: Wednesday, May 8, 2019 3:52 PM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] slow performance in llc.exe to do with large global floating point arrays
We are building a neural network compiler using LLVM, see https://github.com/Microsoft/ELL.
We want to put the neural network weights into a bunch of global float arrays because it allows us to more easily leverage
Flash RAM on small embedded devices. For example, it enables these kinds of scenarios:
keyword spotting demo<https://lovettchris.github.io/posts/keyword_spotting/>.
We are finding some pretty bad compiler performance in some cases. For example, this github gist<https://gist.github.com/lovettchris/91e30bce1d18f16eddaf67306101e4e0> contains a bitcode file which is a neural network compiled by ELL and it has about 30mb of floating point data, and when we put that through llc it takes 262 seconds to compile (on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we strip out the weights the "code" component of our neural network inference takes only 2 seconds to compile.
We've noticed a good improvement in LLVM 8.0 in this area, but we think there's still a lot more that could be done. For example,
is it possible to dump big arrays of global floating point data into a binary without invoking huge assembly writer overhead?
Perhaps what is happening is the optimizer is trying to optimized away unused floats but we would like to disable that and just
tell the compiler dump the floats into the object file, don't bother trying to optimize them....
Any thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190510/0ca05a18/attachment.html>
More information about the llvm-dev
mailing list