[PATCH][instcombine]: Slice a big load in two loads when the element are next to each other in memory.

Tue Oct 8 10:28:42 PDT 2013

On Oct 8, 2013, at 9:30 AM, Quentin Colombet <qcolombet at apple.com> wrote:

> To me, vector registers hold both values within the same register, whereas a paired load defines two different registers.

That’s not really accurate on something like AArch32, where float vector registers are just pairs of float scalar registers.  The problem generalizes to other targets with undifferentiated register files.  Scalar GPUs typically don’t differentiate between a sequence of scalar registers and a vector of that width.

> Assuming the vector registers and the regular registers are on the same bank, i.e., extracting both values from the vector register is free, it may be still interesting to use the paired load because it may be less constrained by register allocation.

That seems like a cost analysis to me.  Most CPU architectures I know of have plenty of registers in their FP/vector banks, so it seems unlikely that the small reduction in register pressure would be worth it for the increase pressure on the load pipe.

> Now, if the registers are not on the same bank, we may definitely not want to generate the vector load, but still be able to catch the paired load.

It depends on where the values are going to end up.  In the example you gave involving _Complex, the ultimate destination type is float, even though the load is currently written as occurring in the integer bank.  On AArch32, it seems like the optimal code would be a <2 x float> load, and the individual lanes can then be sourced directly by their users.  On X86, it’s unclear to me why it would be better to generate two scalar loads versus one <2 x float> load and a lane extract to get the high lane out.

>>>> Even on x86, you’re probably best off trying to turn this into a <2 x i32> vector load + lane copy rather than splitting the load.
>>> My plan here was to turn this into two split loads, then having another dag combine for merging those loads into a vector load. I though maybe target supporting paired load already do this kind of target specific combine.
>> 
>> You certainly can go that way, but it sounds challenging to implement.  My point was that it seems like you could achieve most of the same benefit with a DAG combine that turned this example into a <2 x float> vector load, with much less concern about the cost model, etc.
> My concern then is that we may introduce cross register bank copies (e.g., NEON to regular register) and in that case, the new code would be more expensive.
> My approach here was to not insert vector code. I believe the same approach was used in SROA: do not insert vector code, unless the code has already been vectorized.

Ultimately, I don’t think this is an optimization that you can do without a lot of target profitability knowledge.    With that information, you can introduce vector on targets that want them, and avoid doing so on targets that don’t.  Even within a single target, the choice of whether to split or to vectorize is dependent on the ultimate type of the users of the values.  On AArch32, we’re prefer paired loads for integers (unless they’re being fed to integer NEON instructions!), but vector loads for floats .

—Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131008/e02bd24a/attachment.html>