Hello folks,<div><br></div><div>Based on a few reports, I've been tracking down some extremely slow compiles of small, reasonable code snippets, and it turns out that most of them look exactly like PR13392. Not only does creating i1024 and i2048 variables everywhere in SROA confuse the daylights out of the ARM codegen, it also makes lots of the IR and DAG optimizers slower because it slows down ComputeDemandedBits and other APInt operations on these values.</div>

<div><br></div><div>To re-cap from the bug, SROA sees something like:</div><div><br></div><div>void f(...) {</div><div>  double data[16];</div><div>  ... lots of math ....</div><div>}</div><div><br></div><div>And it turns these 16 x 8-byte alloca into a single i1024 value. =[</div>

<div><br></div><div>There is a very direct solution to this, we can enhance the part of SROA that converts an aggregate alloca into an alloca of a single integer (or a vector) when that integer type is a valid type for the target. Unfortunately, this essentially turns off SROA for most arrays. The performance implications are quite bad.</div>

<div><br></div><div><br></div><div>The reason why normal SROA doesn't kick in here is pretty straight forward as well -- we have a set of thresholds that limit how large of entities SROA will process. These take the form of an element-wise limit and a size limit. Many of the cases which the above SROA is "handling" are large enough to exceed any of these limits, so I dug into the fundamental reason why the limits existed: <a href="http://llvm.org/PR1446">http://llvm.org/PR1446</a></div>

<div><br></div><div>It turns out that if you take that test case, update it a bit and run it through todays LLVM pass, it is optimized very efficiently. But it does explode the IR from 1 instruction (memcpy) to 1k instructions when it hits a large array in use with a memcpy.</div>

<div><br></div><div>This is the fundamental thing that seems important to preserve in limiting SROA: we don't want to have the growth of IR due to SROA be a factor of the size of the aggregate, we want it to be a factor of the size of the existing IR *using* that aggregate. This is a much more targeted threshold.</div>

<div><br></div><div>I've attached a very rough patch that seems to make this switch. It does three things:</div><div><br></div><div>1) Remove the default thresholds. They remain available although I question their utility....</div>

<div>2) Add a requirement to the logic that converts an aggregate alloca to a single integer alloca that unless there is a vector load/store involved, the bitwidth must fit in a legal integer for the target.</div><div>3) Add a new check to normal SROA which checks whether an element-wise access stems from a memcpy and touches an element that could not itself be converted to a vector or integer alloca.</div>

<div><br></div><div>With #3, we should still allow forming vector loads & stores, and large sub-aggregate object loads & stores, as those don't bloat the IR (slowing down compiles) and should already be lowered efficiently in the backend. This helps ensure we still decompose aggregates as aggressively as possible in SROA.</div>

<div><br></div><div><br></div><div>I've run these changes through the nightly test suite, and so far the numbers are good. My machine is sadly quite noisy, but investigating the swings in execution times, the only test which slowed down significantly (more than 5%) seems to be the very test case in PR13392. The generated IR is *much* better now, but the register allocator makes some bad decisions and we end up with about 4x the amount of time stalled in the frontend of the CPU.</div>

<div><br></div><div>The compile times for the test case in PR13392 and for a sha1 implementation reported on the mailing list, are *greatly* improved -- 2x faster in some cases.</div><div><br></div><div><br></div><div>I'm still updating the regression tests which were lacking legal integer sizes and/or were relying on illegal integer sizes in the output, but I think the code essentially "works". I'll also be doing more extensive performance testing. =]</div>

<div><br></div><div>-Chandler</div>